subject:"python 2.7 and unicode \(one more time\)"

Re: python 2.7 and unicode (one more time)

2014-12-02 Thread Simon Evans


Hi Peter Otten
re:

There is no assignment 

soup_atag = whatever 

but there is one to atag. The whole session should when you omit the 
offending line 

 atag = soup_atag.a 

or insert 

soup_atag = soup 

before it. 

Python 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)] on win
32
Type help, copyright, credits or license for more information.
 import urllib2
 from bs4 import BeautifulSoup
 html_atag = htmlbodypTest html a tag example/p
... a href=http://www.packtpub.com'Home/a
... a href=http;//www.packtpub.com/books'.Books/a
... /body
... /html
 soup = BeautifulSoup(html_atag,'lxml')
 atag = soup.aprint(atag)
 atag = soup.a
 print(atag)
a href=http://www.packtpub.com'gt;Homelt;/agt;
lt;a href= http=
/a
 type(atag)
class 'bs4.element.Tag'
 tagname = atag.name
 print tagname
a
 atag.name = 'p'
 print (soup)
htmlbodypTest html a tag example/p
p href=http://www.packtpub.com'gt;Homelt;/agt;
lt;a href= http=
/p/body
/html
 atag.name = 'p'
 print(soup)
htmlbodypTest html a tag example/p
p href=http://www.packtpub.com'gt;Homelt;/agt;
lt;a href= http=
/p/body
/html
 atag.name = 'a'
 print(soup)
htmlbodypTest html a tag example/p
a href=http://www.packtpub.com'gt;Homelt;/agt;
lt;a href= http=
/a/body
/html
 soup_atag = soup
 atag = soup_atag.a
 print (atag['href'])
http://www.packtpub.com'Home/a
a href=


Thank you.
Yours
Simon.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-25 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 Marko Rauhamaa wrote:

 Py3's byte strings are still strings, though.
 
 Hm. I don't think so. In a plain English sense, maybe, but that kind of
 usage can lead to confusion.

 Only if you are determined to confuse yourself.

 {...]

 In Python usage, string always refers to the `str` type, unless
 prefixed with byte, in which case it refers to the immutable
 byte-string type (`str` in Python 2, `bytes` in Python 3.)
 
 You are saying what I'm saying.
 
 Byte strings are *not* strings.

Of course they are. They are strings of bytes, just as the name suggests.


 Prairie dogs are not dogs. No need to call dogs domesticated dogs to
 tell them apart from prairie dogs.

But wild dogs *are* dogs, and there is a need to distinguish between wild
dogs and domesticated dogs. 

Just as there is a need to distinguish between byte strings, ASCII strings,
Latin-1 strings, Big5 strings, Unicode strings, Tron strings and cheese
strings.

I think this conversation is going nowhere, so it's probably best to end it.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-25 Thread Chris Angelico

On Tue, Nov 25, 2014 at 10:56 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 I think this conversation is going nowhere, so it's probably best to end it.

\0

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Py3's byte strings are still strings, though.
 
 Hm. I don't think so. In a plain English sense, maybe, but that kind of
 usage can lead to confusion.

Only if you are determined to confuse yourself.

People are quite capable of interpreting correctly sentences like:

My friend Susan and I were talking about Jenny, and she said that she had
had a horrible fight with her boyfriend and was breaking up with him.

and despite the ambiguity correctly interpret who she and her refers to
each time. Compared to that, correctly understanding the mild complexity
of string is trivial.

In Python usage, string always refers to the `str` type, unless prefixed
with byte, in which case it refers to the immutable byte-string type
(`str` in Python 2, `bytes` in Python 3.)

Unicode string always refers to the immutable Unicode string type
(`unicode` in Python 2, `str` in Python 3).

Text string is more ambiguous. Some people consider the prefix to be
redundant, e.g. text string always refers to `str`, while others consider
it to be in opposition to byte string, i.e. to be a synonym for Unicode
string.

In all cases apart from an explicit byte string, the word string is
always used for the native array-of-characters type delimited by plain
quotation marks, as used for error messages, user prompts, etc., regardless
whether the implementation is an array of 8-bit bytes (as used by Python
2), or the full Unicode character set (as used by Python 3). So in
practice, provided you know which version of Python is being discussed,
there is never any genuine ambiguity when using the word string and no
excuse for confusion.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Chris Angelico

On Tue, Nov 25, 2014 at 9:56 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 In all cases apart from an explicit byte string, the word string is
 always used for the native array-of-characters type delimited by plain
 quotation marks, as used for error messages, user prompts, etc., regardless
 whether the implementation is an array of 8-bit bytes (as used by Python
 2), or the full Unicode character set (as used by Python 3). So in
 practice, provided you know which version of Python is being discussed,
 there is never any genuine ambiguity when using the word string and no
 excuse for confusion.

And frequently, even if you're talking about Py2/Py3 cross code,
there's still no ambiguity about the word string: it means a
default-for-the-language string. The locale.setlocale() function
expects a string as its second parameter, for instance. (And
unfortunately, flatly refuses the other sort, whichever way around
that is.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-24 Thread Marko Rauhamaa

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 Marko Rauhamaa wrote:

 Py3's byte strings are still strings, though.
 
 Hm. I don't think so. In a plain English sense, maybe, but that kind of
 usage can lead to confusion.

 Only if you are determined to confuse yourself.

 {...]

 In Python usage, string always refers to the `str` type, unless
 prefixed with byte, in which case it refers to the immutable
 byte-string type (`str` in Python 2, `bytes` in Python 3.)

You are saying what I'm saying.

Byte strings are *not* strings.

Prairie dogs are not dogs. No need to call dogs domesticated dogs to
tell them apart from prairie dogs.


Marko


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 3:33 AM, Dennis Lee Bieber
wlfr...@ix.netcom.com wrote:
 On Sat, 22 Nov 2014 20:52:37 -0500, random...@fastmail.us declaimed the
 following:

On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
 ...
 That is a standard Windows build. He is again conflating problems with
 using the Windows command line for a given code page with the FSR.

The thing is, with a truetype font selected, a correctly written win32
console problem should be able to print any character without caring

 Why would that be possible? Many truetype fonts only supply glyphs for
 single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
 character map utility and see what some of the font files contain.

A program should be able to print those characters even if they all
look identical. Chances are you can copy and paste them into something
else. But yes, finding a suitable font that covers the whole Unicode
range is *hard*. I've struggled with this one with a few programs (and
I still haven't managed to get VLC to satisfactorily display subtitles
that include Chinese characters).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread random832

On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:
   Why would that be possible? Many truetype fonts only supply glyphs for
 single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
 character map utility and see what some of the font files contain.

With a bitmap font selected, the characters will be immediately replaced
with characters present in the font's codepage, and will copy to
clipboard as such.

With a truetype font (Lucida Console or Consolas) selected, the
characters will be displayed as replacement glyphs (box with a question
mark in it) if not present in the font, but *will still copy to the
clipboard as the original code point* (which you might notice is where
we started, with someone claiming success by being able to do so with
codepage 65001 selected). And in any case, all characters that *are* in
the font will work and display correctly, rather than only those in the
OEM codepage.

   Heck -- on my current machine, the True Type fonts are all old
 third-party items. All the standard fonts are now Open Type.

The win32 console's configuration UI refers to opentype fonts as
truetype. Opentype fonts can use either truetype or type 1 as the
underlying format, and all opentype fonts supplied with windows use
truetype. You are being excessively pedantic in objecting to my use of
the term truetype.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Dave Angel


On 11/23/2014 01:13 PM, random...@fastmail.us wrote:

On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:

Why would that be possible? Many truetype fonts only supply glyphs for
single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
character map utility and see what some of the font files contain.


With a bitmap font selected, the characters will be immediately replaced
with characters present in the font's codepage, and will copy to
clipboard as such.


I didn't realize Windows shell (DOS box) had that bug.  Course I don't 
use Windows much the last few years.


it's one thing to not display it properly.  It's quite another to supply 
faulty data to the clipboard.  Especially since the Windows clipboard 
has a separate Unicode type available.


--
DaveA
--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 7:31 AM, Dave Angel d...@davea.name wrote:
 On 11/23/2014 01:13 PM, random...@fastmail.us wrote:

 On Sun, Nov 23, 2014, at 11:33, Dennis Lee Bieber wrote:

 Why would that be possible? Many truetype fonts only supply
 glyphs for
 single-byte encodings (ISO-Latin-1, for example -- pop up the Windows
 character map utility and see what some of the font files contain.


 With a bitmap font selected, the characters will be immediately replaced
 with characters present in the font's codepage, and will copy to
 clipboard as such.


 I didn't realize Windows shell (DOS box) had that bug.  Course I don't use
 Windows much the last few years.

Likewise. I've been accustomed to copying and pasting unrecognized
characters (one of the easiest solutions is to paste them into a
Python console - ord() for one character, or a Py2 repr() for multiple
- to quickly see what the codepoints are), relying on the clipboard
getting the exact same sequence that was printed by the application.
Thanks, Windows, just what I always wanted to hear.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Gregory Ewing


Marko Rauhamaa wrote:

Unicode strings is not wrong but the technical emphasis on Unicode is as
strange as a tire car or rectangular door when car and door are
what you usually mean.


The reason Unicode gets emphasised so much is that
until relatively recently, it *wasn't* what string
usually meant in Python.

When Python 3 has been around for as long as Python
2 was, things may change.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 9:51 AM, Gregory Ewing
greg.ew...@canterbury.ac.nz wrote:
 Marko Rauhamaa wrote:

 Unicode strings is not wrong but the technical emphasis on Unicode is as
 strange as a tire car or rectangular door when car and door are
 what you usually mean.


 The reason Unicode gets emphasised so much is that
 until relatively recently, it *wasn't* what string
 usually meant in Python.

 When Python 3 has been around for as long as Python
 2 was, things may change.

I doubt it; the bytes() type is sufficiently stringy to require the
distinction to still be made. PEP 461 makes it clear that byte strings
are not blobs of opaque data, but are very definitely ASCII-compatible
objects, for the benefit of boundary code.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread random832

On Sun, Nov 23, 2014, at 15:31, Dave Angel wrote:
 I didn't realize Windows shell (DOS box) had that bug.  Course I don't 
 use Windows much the last few years.
 
 it's one thing to not display it properly.  It's quite another to supply 
 faulty data to the clipboard.  Especially since the Windows clipboard 
 has a separate Unicode type available.

It's because console bitmap fonts almost always (always?) only have one
codepage's worth of characters, and it's considered better to display A
for U+0100 than a blank space, and the clipboard has always been a bit
of an afterthought for the windows console. Meanwhile, a truetype font
is considered likely to have real glyphs for most characters a user
would want to display, so no conversion is done. And there's no font
rendering routine for bitmap fonts that will allow for dynamic
substitution of glyphs, so it becomes a real A (or whatever) in the
console buffer itself - this isn't a conversion done at clipboard-copy
time.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Marko Rauhamaa

Gregory Ewing greg.ew...@canterbury.ac.nz:
 Marko Rauhamaa wrote:
 Unicode strings is not wrong but the technical emphasis on Unicode is as
 strange as a tire car or rectangular door when car and door are
 what you usually mean.

 The reason Unicode gets emphasised so much is that until relatively
 recently, it *wasn't* what string usually meant in Python.

 When Python 3 has been around for as long as Python 2 was, things may
 change.

Yes, people call strings Unicdoe strings because Python2 *did have*
unicode strings separate from regular strings:

Python2Python3
--
string bytes (byte string)
unicode string string


In Python2 days, Unicode was a fancy, exotic datatype for the
connoisseurs. The rest used strings. Python3 supposedly elevates Unicode
to boring normalcy. Now it's bytes that have fallen into (unmerited)
disfavor.

But old habits die hard; you call cars automobile cars instead of
cars since, after all, cars were always pulled by horses...


Marko

PS Maybe interestingly, Guile went through an analogous transition. As
of Guile 2.0,

  a character is anything in the Unicode Character Database.
  [...]
  Strings are fixed-length sequences of characters.
  [...]
  A bytevector is a raw bit string.

  URL: https://www.gnu.org/software/guile/manual/html_node/index.html

However, Guile 1.8 still had:

  The Guile implementation of character sets currently deals only with
  8-bit characters.

  URL: https://www.gnu.org/software/guile/docs/docs-1.8/guile-ref/inde
  x.html

and there were no bytevectors.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Chris Angelico

On Mon, Nov 24, 2014 at 5:57 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Yes, people call strings Unicdoe strings because Python2 *did have*
 unicode strings separate from regular strings:

 Python2Python3
 --
 string bytes (byte string)
 unicode string string


 In Python2 days, Unicode was a fancy, exotic datatype for the
 connoisseurs. The rest used strings. Python3 supposedly elevates Unicode
 to boring normalcy. Now it's bytes that have fallen into (unmerited)
 disfavor.

Py3's byte strings are still strings, though. People don't use
bytearray for everything.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-23 Thread Marko Rauhamaa

Chris Angelico ros...@gmail.com:

 Py3's byte strings are still strings, though.

Hm. I don't think so. In a plain English sense, maybe, but that kind of
usage can lead to confusion.

For example,

   A subscription selects an item of a sequence (string, tuple or list)
   or mapping (dictionary) object:

   subscription ::=  primary [ expression_list ]

   [...]

   A string’s items are characters. A character is not a separate data
   type but a string of exactly one character.

   URL: https://docs.python.org/3/reference/expressions.html#subscripti
   ons


The text is probably a bit buggy since it skates over bytes and byte
arrays listed as sequences (by URL:
https://docs.python.org/3/reference/datamodel.html). However, your
Python3 implementation would fail if it interpreted bytes objects to be
strings in the above paragraph:

abc[1]
   'b'
b'abc'[1]
   98

The subscription of a *string* evaluates to a *string*. The subscription
of a *bytes* object evaluates to a *number*.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
 In Python, we have Unicode strings and byte strings.
 
 No, you don't. You have strings and bytes:

Python has strings of Unicode code points, a.k.a. Unicode strings,
or text strings, and strings of bytes, a.k.a. byte strings. These are
the plain English descriptive names of the types str and bytes. 


   Textual data in Python is handled with str objects, or strings.
   Strings are immutable sequences of Unicode code points. String
   literals are written in a variety of ways: [...]

Hence, Unicode string.


   URL: https://docs.python.org/3/library/stdtypes.html#text-sequence-typ
   e-str
 
   The core built-in types for manipulating binary data are bytes and
   bytearray.

Which are strings of bytes.


   URL: https://docs.python.org/3/library/stdtypes.html#binary-sequence-t
   ypes-bytes-bytearray-memoryview
 
 
 Equivalently, I wouldn't mind character strings vs byte strings.

Unicode strings are not strings of characters, except informally. Some code
points represent non-characters:

http://www.unicode.org/faq/private_use.html#nonchar1

They are strings of Unicode code points, but code point string is firstly
an inelegant and ugly phrase, and secondly ambiguous. What sort of code
points? Baudot codes? ASCII codes? Big5 codes? Tron codes? No, none of the
above, they are *Unicode* code points.

You haven't given any good reason for objecting to calling Unicode strings
by what they are. Maybe you think that it is an implementation detail, and
that some version of Python might suddenly and without warning change to
only supporting KOI8-R strings or GB2312 strings? If so, you are badly
mistaken. The fact that Python strings are Unicode is not an implementation
detail, it is part of the language semantics.


 Unicode strings is not wrong but the technical emphasis on Unicode is as
 strange as a tire car or rectangular door when car and door are
 what you usually mean.

Tire car makes no sense. Rectangular door makes perfect sense, and in a
world where there are dozens of legacy non-rectangular doors, it would be
very sensible to specify the kind of door. Just as we specify sliding door,
glass door, security door, fire door, flyscreen wire door, and so on.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 12:50 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Tire car makes no sense. Rectangular door makes perfect sense, and in a
 world where there are dozens of legacy non-rectangular doors, it would be
 very sensible to specify the kind of door. Just as we specify sliding door,
 glass door, security door, fire door, flyscreen wire door, and so on.

Not just legacy - scifi often has non-rectangular doors. (And they're
often HORRIBLY impractical. I think the rectangular door is here to
stay.) But English is a strange beast. A glass door is made of
glass... a flyscreen wire door is made of (at least, has a significant
component of) flyscreen, but a fire door isn't made of fire...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 You haven't given any good reason for objecting to calling Unicode
 strings by what they are. Maybe you think that it is an implementation
 detail, and that some version of Python might suddenly and without
 warning change to only supporting KOI8-R strings or GB2312 strings? If
 so, you are badly mistaken. The fact that Python strings are Unicode
 is not an implementation detail, it is part of the language semantics.

To me, repeating the word Unicode everywhere is giving the (in and of
itself impressive) standard too primary a status. While understanding
how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
useful and occasionally can be taken explicit advantage of, those really
are mundane techniques to implement abstractions.

Python's strings exist (primarily) so you can express utterances in a
human language, aka plain text. They don't exist to express Unicode code
points. That would be putting the cart before the horse.

 Rectangular door makes perfect sense, and in a world where there are
 dozens of legacy non-rectangular doors, it would be very sensible to
 specify the kind of door.

It makes sense, and yet, I've never heard anyone talk about rectangular
doors even though I use numerous doors every day. Why is it, then, that
people feel the constant need to add the Unicode epithet to Python's
strings, which -- according to its own specification -- are just
strings?


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Roy Smith

In article 87y4r348uf@elektro.pacujo.net,
 Marko Rauhamaa ma...@pacujo.net wrote:

 Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:
 
  You haven't given any good reason for objecting to calling Unicode
  strings by what they are. Maybe you think that it is an implementation
  detail, and that some version of Python might suddenly and without
  warning change to only supporting KOI8-R strings or GB2312 strings? If
  so, you are badly mistaken. The fact that Python strings are Unicode
  is not an implementation detail, it is part of the language semantics.
 
 To me, repeating the word Unicode everywhere is giving the (in and of
 itself impressive) standard too primary a status. While understanding
 how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
 useful and occasionally can be taken explicit advantage of, those really
 are mundane techniques to implement abstractions.
 
 Python's strings exist (primarily) so you can express utterances in a
 human language, aka plain text. They don't exist to express Unicode code
 points. That would be putting the cart before the horse.
 
  Rectangular door makes perfect sense, and in a world where there are
  dozens of legacy non-rectangular doors, it would be very sensible to
  specify the kind of door.
 
 It makes sense, and yet, I've never heard anyone talk about rectangular
 doors even though I use numerous doors every day. Why is it, then, that
 people feel the constant need to add the Unicode epithet to Python's
 strings, which -- according to its own specification -- are just
 strings?
 
 
 Marko

There's a old joke to the effect that the fields of study which are 
confident that they're really doing science (i.e. chemistry, biology, 
physics, astronomy, etc) don't put the word science in their names.  
It's only the fields of study that are less confident about their status 
as sciences (computer science, behavioral science, political science, 
etc) that feel the need to explicitly say science.  As if repeating it 
enough times makes it true.  I wonder if something of the same thing 
applies here?  ducking and running

Somewhat more seriously, the IEEE-754 point is quite apropos.  Back when 
754 first came out, there were lots of different floating point 
implementations.  Machines that used 754 touted it in their sales 
literature and mentioned it all over their documentation.  These days, 
754 is so ubiquitous, nobody even thinks to mention it, in the same way 
nobody bothers to mention 2's complement integers.  I suspect that some 
day, the same thing will happen with Unicode.  For that matter, we will 
eventually get to the point where when people say, just plain text, 
they will mean Unicode, in the same way that just plain text today 
really means ASCII (and the text/plain MIME type will become a 
historical curiosity).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

Roy Smith r...@panix.com:

 For that matter, we will eventually get to the point where when people
 say, just plain text, they will mean Unicode, in the same way that
 just plain text today really means ASCII (and the text/plain MIME
 type will become a historical curiosity).

MIME has:

   Content-Type: text/plain; charset=UTF-8

(even though UTF-8 isn't a character set but a content encoding).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Rustom Mody

On Saturday, November 22, 2014 8:14:15 PM UTC+5:30, Roy Smith wrote:
  Marko Rauhamaa wrote:
 
  Steven D'Aprano:
  
   You haven't given any good reason for objecting to calling Unicode
   strings by what they are. Maybe you think that it is an implementation
   detail, and that some version of Python might suddenly and without
   warning change to only supporting KOI8-R strings or GB2312 strings? If
   so, you are badly mistaken. The fact that Python strings are Unicode
   is not an implementation detail, it is part of the language semantics.
  
  To me, repeating the word Unicode everywhere is giving the (in and of
  itself impressive) standard too primary a status. While understanding
  how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
  useful and occasionally can be taken explicit advantage of, those really
  are mundane techniques to implement abstractions.
  
  Python's strings exist (primarily) so you can express utterances in a
  human language, aka plain text. They don't exist to express Unicode code
  points. That would be putting the cart before the horse.
  
   Rectangular door makes perfect sense, and in a world where there are
   dozens of legacy non-rectangular doors, it would be very sensible to
   specify the kind of door.
  
  It makes sense, and yet, I've never heard anyone talk about rectangular
  doors even though I use numerous doors every day. Why is it, then, that
  people feel the constant need to add the Unicode epithet to Python's
  strings, which -- according to its own specification -- are just
  strings?
  
  
  Marko
 
 There's a old joke to the effect that the fields of study which are 
 confident that they're really doing science (i.e. chemistry, biology, 
 physics, astronomy, etc) don't put the word science in their names.  
 It's only the fields of study that are less confident about their status 
 as sciences (computer science, behavioral science, political science, 
 etc) that feel the need to explicitly say science.  As if repeating it 
 enough times makes it true.  I wonder if something of the same thing 
 applies here?  ducking and running
 
 Somewhat more seriously, the IEEE-754 point is quite apropos.  Back when 
 754 first came out, there were lots of different floating point 
 implementations.  Machines that used 754 touted it in their sales 
 literature and mentioned it all over their documentation.  These days, 
 754 is so ubiquitous, nobody even thinks to mention it, in the same way 
 nobody bothers to mention 2's complement integers.  I suspect that some 
 day, the same thing will happen with Unicode.  For that matter, we will 
 eventually get to the point where when people say, just plain text, 
 they will mean Unicode, in the same way that just plain text today 
 really means ASCII (and the text/plain MIME type will become a 
 historical curiosity).

Yes this was my point also -- encodings in general and unicode in
particular is a mess (as of 2014).  Maybe in a few years the dust 
will settle.  Then saying 'unicode' will become redundant.
But until then when we have a rather leaky abstraction having
sealing liquid on the hands is preferable to sewage in the house.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Marko Rauhamaa

wxjmfa...@gmail.com:

 - By chance, I found on the web a German py dev who was commenting and
 he had not an updated DUDEN (a German dictionnary).

That... leaves me utterly speachless!


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence


On 22/11/2014 17:49, Marko Rauhamaa wrote:

wxjmfa...@gmail.com:


- By chance, I found on the web a German py dev who was commenting and
he had not an updated DUDEN (a German dictionnary).


That... leaves me utterly speachless!


Marko



Please don't feed him.  Your average troll is bad enough but he really 
takes the biscuit.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 Please don't feed him.  Your average troll is bad enough but he really takes
 the biscuit.

... someone was feeding him biscuits?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence


On 22/11/2014 20:17, Chris Angelico wrote:

On Sun, Nov 23, 2014 at 5:17 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:

Please don't feed him.  Your average troll is bad enough but he really takes
the biscuit.


... someone was feeding him biscuits?

ChrisA



Surely it's better than feeding him unicode?

As I needed cheering up I ventured over to gg and wasn't disappointed 
reading his latest rubbish. My favourite find thousand and one ways to 
make Python crashing or failing. but I don't recall a single bug report 
in the last two years from anybody regarding problems with the FSR, or 
have I missed something?


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:
 My favourite find thousand and one ways to make Python crashing or
 failing. but I don't recall a single bug report in the last two years from
 anybody regarding problems with the FSR, or have I missed something?

What you've missed is the grammar of the sentence you've (partially)
quoted. Clearly he is seeking to make Python, and he is crashing or
failing. My advice to him: Stop trying to build complex software while
in command of a car.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Mark Lawrence


On 22/11/2014 22:31, Chris Angelico wrote:

On Sun, Nov 23, 2014 at 9:04 AM, Mark Lawrence breamore...@yahoo.co.uk wrote:

My favourite find thousand and one ways to make Python crashing or
failing. but I don't recall a single bug report in the last two years from
anybody regarding problems with the FSR, or have I missed something?


What you've missed is the grammar of the sentence you've (partially)
quoted. Clearly he is seeking to make Python, and he is crashing or
failing. My advice to him: Stop trying to build complex software while
in command of a car.

ChrisA




What?  The entire message follows.

quote
I think you are not understanding the point very well.

Py32 and Qt derivative + plenty of dirty tricks.
(It will probably not be rendered correctly.)

Write something like this (an interactive interpreter)
in Py32 and Py33 and see what happens:

 print(999)
999
 sys.version
'3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)]'
 # note the emoji and the private use area (plane 15)
 a = 'abc\u00e9\u0153\u20ac\u1e9e\U0001f300\udb80\udc00z'
 print(a)
abcéœ€ẞz


Note: it can be cut/copied/pasted with a MS product.

jmf

PS I have to recognized, I'm slowly getting tired to
find thousand and one ways to make Python crashing
or failing.
/quote

That is a standard Windows build. He is again conflating problems with 
using the Windows command line for a given code page with the FSR.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote:
 I really don't understand what bothers you about this. In Python, we have
 Unicode strings and byte strings. In computing in general, strings can
 consist of Unicode characters, ASCII characters, Tron characters, EBCDID
 characters, ISO-8859-7 characters, and literally dozens of others. It
 boogles my mind that you are so opposed to being explicit about what sort
 of string we are dealing with.

I think he means that it should be implementation-defined with an API
that does not allow programs to make assumptions about the encoding,
like C. To allow for implementations that use a different character set.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
 ...
 That is a standard Windows build. He is again conflating problems with 
 using the Windows command line for a given code page with the FSR.

The thing is, with a truetype font selected, a correctly written win32
console problem should be able to print any character without caring
about codepages (via use of WriteConsoleW instead of WriteFile). You
cannot rely on having the codepage set to 65001, especially since 65001
isn't actually a fully supported codepage.

In my opinion it is a deficiency in the win32 support, rather than
unicode support (and certainly nothing to do with the FSR), but it _is_
a deficiency.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 12:52 PM,  random...@fastmail.us wrote:
 On Sat, Nov 22, 2014, at 18:38, Mark Lawrence wrote:
 ...
 That is a standard Windows build. He is again conflating problems with
 using the Windows command line for a given code page with the FSR.

 The thing is, with a truetype font selected, a correctly written win32
 console problem should be able to print any character without caring
 about codepages (via use of WriteConsoleW instead of WriteFile). You
 cannot rely on having the codepage set to 65001, especially since 65001
 isn't actually a fully supported codepage.

Is that true? Does WriteConsoleW support every Unicode character? It's
not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe
something else).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread random832

On Sat, Nov 22, 2014, at 21:11, Chris Angelico wrote:
 Is that true? Does WriteConsoleW support every Unicode character? It's
 not obvious from the docs whether it uses UCS-2 or UTF-16 (or maybe
 something else).

I was defining every unicode character loosely. There are certainly
display problems (there are display problems with wide characters on
non-CJK windows versions, too), but if you write a surrogate pair,
you'll get something that can copy to the clipboard as a surrogate pair,
and get the same thing that writing a non-BMP UTF-8 character with
codepage 65001 will give you. And you certainly won't get an error.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Steven D'Aprano

random...@fastmail.us wrote:

 On Fri, Nov 21, 2014, at 23:38, Steven D'Aprano wrote:
 I really don't understand what bothers you about this. In Python, we have
 Unicode strings and byte strings. In computing in general, strings can
 consist of Unicode characters, ASCII characters, Tron characters, EBCDID
 characters, ISO-8859-7 characters, and literally dozens of others. It
 boogles my mind that you are so opposed to being explicit about what sort
 of string we are dealing with.
 
 I think he means that it should be implementation-defined with an API
 that does not allow programs to make assumptions about the encoding,
 like C. To allow for implementations that use a different character set.

Python is not C, and doesn't make every second thing undefined behaviour.

If Python treated the character set as an implementation detail, the
programmer would have no way of knowing whether

s = uö

is legal or not, since you cannot know whether or not ö is a supported
character in the running Python. It might work on your system, and fail for
other people. That is worse than the old distinction between narrow
and wide builds. It would be a lazy and stupid design, and especially
stupid since there really in no good alternative to Unicode today. ASCII is
not even sufficient for American English, the whole Windows code page idea
is a horrible mess, none of the legacy encodings are suitable for more than
a tiny fraction of the world.


-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-22 Thread Chris Angelico

On Sun, Nov 23, 2014 at 5:17 PM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 If Python treated the character set as an implementation detail, the
 programmer would have no way of knowing whether

 s = uö

 is legal or not, since you cannot know whether or not ö is a supported
 character in the running Python. It might work on your system, and fail for
 other people. That is worse than the old distinction between narrow
 and wide builds. It would be a lazy and stupid design, and especially
 stupid since there really in no good alternative to Unicode today. ASCII is
 not even sufficient for American English, the whole Windows code page idea
 is a horrible mess, none of the legacy encodings are suitable for more than
 a tiny fraction of the world.

(Code pages aren't a Windows concept, of course, though I guess that's
the main place where they're found on PCs today.)

The only trouble with enforcing Unicode is Japanese encodings and the
whole Han unification debate. Ultimately, you have to pick a side: are
you siding with those who say there are fewer characters with multiple
forms, or with those who say there are more distinct characters? If
the former, go with Unicode. If the latter, be prepared to do heaps of
work yourself, and probably be stuck with supporting only Japanese,
because encodings like Shift-JIS aren't going to be able to represent
Scandinavian text.

Me, I'm siding with Unicode. The politicking of Han unification
doesn't interest me, so I'm happy to accept a position that says that
they're all the same character, just as the Roman letter A can be used
in English, Italian, German, Swedish, etc, etc, etc (maybe with some
combining characters for diacriticals). That gives me access to all
the world's languages with a single character set and some trustworthy
encodings. I think it's a fine trade-off: philosophy I don't care
about versus correctness in my code.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Chris Angelico ros...@gmail.com:

 Then you need to read more about Unicode. The *codepoint* for the
 letter 'A' is 65. That is not Unicode, that is one part of the Unicode
 spec.

I don't think Python users need to know anything more about Unicode than
they need to know about IEEE-754.

How many bits are reserved for the mantissa? I don't remember and I
don't see why I should care.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Fri, Nov 21, 2014 at 7:16 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 Then you need to read more about Unicode. The *codepoint* for the
 letter 'A' is 65. That is not Unicode, that is one part of the Unicode
 spec.

 I don't think Python users need to know anything more about Unicode than
 they need to know about IEEE-754.

 How many bits are reserved for the mantissa? I don't remember and I
 don't see why I should care.

At what point can a Python float no longer represent every integer?
That's why you should care.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Steven D'Aprano

Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 (E.g. there are millions of existing files across the world containing
 text which use legacy encodings that are not compatible with Unicode.)
 
 Not compatible with Unicode? There aren't many character sets out
 there that include characters not in Unicode - that was the whole
 point. Of course, there are plenty of files in unspecified eight-bit
 encodings, so you may have a problem with reliable decoding - but if
 you know what the encoding is, you ought to be able to represent each
 character in Unicode.

What I meant was that some encodings -- namely ASCII and Latin-1 -- the
ordinals are exactly equivalent to Unicode, that is:

# Python 3
for i in range(128):
assert chr(i).encode('ASCII') == bytes([i])

for i in range(256):
assert chr(i).encode('Latin-1') == bytes([i])


That's not quite as significant as I thought, though. What is significant is
that a pure ASCII file on disk can be read by a program assuming UTF-8:

for i in range(128):
assert chr(i).encode('UTF-8') == bytes([i])


although the same is not the case for Latin-1 encoded files.


 Not compatible with any of the UTFs, that's different. Plenty of that
 in the world.
 
 You are certainly correct that in it's full generality, text is much
 more than just a string of code points. Unicode strings is a primitive
 data type. A powerful and sophisticated text processing application may
 even find Python strings too primitive, possibly needing something like
 ropes of graphemes rather than strings of code points.
 
 That's probably more an efficiency point, though. It should be
 possible to do a perfect two-way translation between your grapheme
 rope and a Python string; otherwise, you'll have great difficulty
 saving your file to the disk (which will normally involve representing
 the text in Unicode, then encoding that to bytes).

Well, yes. My point, agreeing with Marko, is that any time you want to do
something even vaguely related to human-readable text, code points are
not enough. For example, if I give a string containing the following two
code points in this order:

LATIN SMALL LETTER E
COMBINING CIRCUMFLEX ACCENT

then my application should treat that as a single character and display it
as:

LATIN SMALL LETTER E WITH CIRCUMFLEX

which looks like this: ê

rather than two distinct characters eˆ

Now, that specific example is a no-brainer, because the Unicode
normalization routines will handle the conversion. But not every
combination of accented characters has a canonical combined form. What
about something like this?

'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'

If I insert a character into my string, I want to be able to insert before
the w or after the caron, but not in the middle of those three code points.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 2:23 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
 steve+comp.lang.pyt...@pearwood.info wrote:
 (E.g. there are millions of existing files across the world containing
 text which use legacy encodings that are not compatible with Unicode.)

 Not compatible with Unicode? There aren't many character sets out
 there that include characters not in Unicode - that was the whole
 point. Of course, there are plenty of files in unspecified eight-bit
 encodings, so you may have a problem with reliable decoding - but if
 you know what the encoding is, you ought to be able to represent each
 character in Unicode.

 What I meant was that some encodings -- namely ASCII and Latin-1 -- the
 ordinals are exactly equivalent to Unicode, that is:

 That's not quite as significant as I thought, though. What is significant is
 that a pure ASCII file on disk can be read by a program assuming UTF-8:

 although the same is not the case for Latin-1 encoded files.

Yep. Thing is, Unicode can't magically convert all files on all
disks... but with a good codec library, you can at least convert
things as you find them. (I was reading MacRoman files earlier this
year. THAT is an encoding I didn't expect I'd find in 2014.)

 Well, yes. My point, agreeing with Marko, is that any time you want to do
 something even vaguely related to human-readable text, code points are
 not enough. ... What about something like this?

 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING CARON}'

 If I insert a character into my string, I want to be able to insert before
 the w or after the caron, but not in the middle of those three code points.

Yes, which is a concern. Also a concern is the ability to detect other
boundaries, like words. None of these can be easily solved; all of
them can be dealt with by using the Unicode character data, which is
better than you get for most legacy encodings. In terms of Python
strings, it still makes sense to insert characters between those
combining characters; so what you're saying is that a text editor
widget needs to be aware of more than just code points. Which is
trivially obvious in the presence of RTL text, too; cursor positions
through differing-direction text will be an issue.

The problems you're citing aren't Unicode problems. They stem from the
complexities of human languages. Unicode just makes them a bit more
visible to English-only-speaking programmers.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Tim Chase

On 2014-11-22 02:23, Steven D'Aprano wrote:
 LATIN SMALL LETTER E
 COMBINING CIRCUMFLEX ACCENT
 
 then my application should treat that as a single character and
 display it as:
 
 LATIN SMALL LETTER E WITH CIRCUMFLEX
 
 which looks like this: ê
 
 rather than two distinct characters eˆ
 
 Now, that specific example is a no-brainer, because the Unicode
 normalization routines will handle the conversion. But not every
 combination of accented characters has a canonical combined form.
 What about something like this?
 
 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING
 CARON}'
 
 If I insert a character into my string, I want to be able to insert
 before the w or after the caron, but not in the middle of those
 three code points.

Things get even weirder if you have

 '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING
 OGONEK}\N{COMBINING CARON}'

and when you try to do comparisons like

 s1 = '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}'
 s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}'
 s3 = 'e\N{COMBINING OGONEK}\N{COMBINING CIRCUMFLEX ACCENT}'
 print(s1 == s2)
 print(s1 == s3)
 print(s2 == s3)

Then you also have the case where you want to edit text and the user
wants to remove the COMBINING OGONEK from the character, so you *do*
want to do something akin to

 s4 = ''.join(c for c in s3 if c != '\N{COMBINING OGONEK}')

And yet, weird things happen if you try to remove the circumflex:

  for test in (s1, s2, s3):
print(test == ''.join(
  c for c in test if c != '\N{COMBINING CIRCUMFLEX ACCENT}'
  )

They all make sense if you understand what's going on under the hood,
but from a visual/conceptual perspective, something feels amiss.

-tkc




-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Rustom Mody

On Friday, November 21, 2014 12:06:54 PM UTC+5:30, Marko Rauhamaa wrote:
 Chris Angelico :
 
  On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa  wrote:
  I don't really like it how Unicode is equated with text, or even
  character strings.
  [...]
  Do you have actual text that you're unable to represent in Unicode?
 
 Not my point at all.
 
 I'm saying equating an abstract data type (string) with its
 representation (Unicode vector) is bad taste.
 
  We don't call numbers IEEE,
 
 Exactly.
 
  Do you genuinely have text that you can't represent in Unicode, or are
  you just arguing against Unicode to try to justify Python strings are
  something else as a basis for your code?
 
 Nobody is arguing against Unicode. I'm saying, let's talk about the
 forest instead of the trees (except when the trees really are the
 focus).

Ive always felt the makers of C showed remarkably good taste in 
the names 'int' and 'float'. Unlike:
Pascal: Int and Real
PL/1: Fixed and Float

IOW the more leaky abstraction used for real numbers is explicitly reminded.

Likewise in 2014, and given the arguments, inconsistencies, etc
remembering the nuts-n-bolts below the strings-represented-as-unicode
abstraction may be in order.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Francis Moreau

On 11/20/2014 04:15 PM, Chris Angelico wrote:
 On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau francis.m...@gmail.com 
 wrote:
 Hi,

 Thanks for the from __future__ import unicode_literals trick, it makes
 that switch much less intrusive.

 However it seems that I will suddenly be trapped by all modules which
 are not prepared to handle unicode. For example:

   from __future__ import unicode_literals
   import locale
   locale.setlocale(locale.LC_ALL, 'fr_FR')
  Traceback (most recent call last):
File stdin, line 1, in module
File /usr/lib64/python2.7/locale.py, line 546, in setlocale
  locale = normalize(_build_localename(locale))
File /usr/lib64/python2.7/locale.py, line 453, in _build_localename
  language, encoding = localetuple
  ValueError: too many values to unpack

 Is the locale module an exception and in that case I'll fix it by doing:

   locale.setlocale(locale.LC_ALL, b'fr_FR')

 or is a (big) part of the modules in python 2.7 still not ready for
 unicode and in that case I have to decide which prefix (u or b) I should
 manually add ?
 
 Sadly, there are quite a lot of parts of Python 2 that simply don't
 handle Unicode strings. But you can probably keep all of those down to
 just a handful of explicit bwhatever strings; most places should
 accept unicode as well as str. What you're seeing here is a prime
 example of one of this author's points (caution, long post):
 
 http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/
 
 The lesson of Python 3 is: give programmers a Unicode string type,
 *make it the default*, and encoding issues will /mostly/ go away.
 
 There's a whole ecosystem to Python 2 - some in the standard library,
 heaps more in the rest of the world - and a lot of it was written on
 the assumption that a byte is a character is an octet. When you pass
 Unicode strings to functions written to expect byte strings, sometimes
 you win, and sometimes you lose... even with the standard library
 itself. But the Python 3 ecosystem has been written on the assumption
 that strings are Unicode. It's only a narrow set of programs
 (boundary code, where you're moving text across networks and stuff
 like that) where the Python 2 model is easier to work with; and the
 recent Py3 releases have been progressively working to relieve that
 pain.
 
 The absolute worst case is a function which exists in Python 2 and 3,
 and requires a byte string in Py2 and a text string in Py3. Sadly,
 that may be exactly what locale.setlocale() is. For that, I would
 suggest explicitly passing stuff through str():
 
 locale.setlocale(locale.LC_ALL, str('fr_FR'))
 
 In Python 3, 'fr_FR' is already a str, so passing it through str()
 will have no significant effect. (Though it would be worth commenting
 that, to make it clear to a subsequent reader that this is Py2 compat
 code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode,
 so passing it through str() will encode it to ASCII, producing a byte
 string that setlocale should be happy with.
 
 By the way, the reason for the strange error message is clearer in
 Python 3, which chains in another exception:
 
 locale.setlocale(locale.LC_ALL, b'fr_FR')
 Traceback (most recent call last):
   File /usr/local/lib/python3.5/locale.py, line 498, in _build_localename
 language, encoding = localetuple
 ValueError: too many values to unpack (expected 2)
 
 During handling of the above exception, another exception occurred:
 
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/local/lib/python3.5/locale.py, line 594, in setlocale
 locale = normalize(_build_localename(locale))
   File /usr/local/lib/python3.5/locale.py, line 507, in _build_localename
 raise TypeError('Locale must be None, a string, or an iterable of
 two strings -- language code, encoding.')
 TypeError: Locale must be None, a string, or an iterable of two
 strings -- language code, encoding.
 
 So when it gets the wrong type of string, it attempts to unpack it as
 an iterable; it yields five values (the five bytes or characters,
 depending on which way it's the wrong type of string), but it's
 expecting two. Fortunately, str() will deal with this. But make sure
 you don't have the b prefix, or str() in Py3 will give you quite a
 different result!
 

Yes I finally used str() since only setlocale() reported to have some
issues with unicode_literals active in my appliction.

Thanks Chris for your useful insight.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 3:11 AM, Francis Moreau francis.m...@gmail.com wrote:
 Yes I finally used str() since only setlocale() reported to have some
 issues with unicode_literals active in my appliction.

 Thanks Chris for your useful insight.

My pleasure. Unicode is a bit of a hobby-horse of mine, so I'm always
happy to see people getting things right :)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Rustom Mody rustompm...@gmail.com:

 Likewise in 2014, and given the arguments, inconsistencies, etc
 remembering the nuts-n-bolts below the strings-represented-as-unicode
 abstraction may be in order.

No need to hide Unicode, but talking about a

   Unicode string

is like talking about an

   electronic computer

   visible spectrum display

   mouse user interface

   ethernet socket

   magnetic file

   electric power supply

The language spec calls the things just strings, as it should.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Chris Angelico

On Sat, Nov 22, 2014 at 3:36 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 No need to hide Unicode, but talking about a

Unicode string

 is like talking about an

electronic computer

visible spectrum display

mouse user interface

ethernet socket

magnetic file

electric power supply

 The language spec calls the things just strings, as it should.

I'm not sure what you mean here, because the adjectives all cut out
other common constructs - a byte string, an analog computer, an IR or
UV display, a blind-compatible UI, a Unix domain socket, an in-memory
file, and a diesel power supply. Okay, I'm pushing it with the last
one (they're usually called gen sets, not power supplies), and I don't
often hear people talk about magnetic files, but the rest are
definitely valid comparison/contrast terms.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Rustom Mody rustompm...@gmail.com:
 
 Likewise in 2014, and given the arguments, inconsistencies, etc
 remembering the nuts-n-bolts below the strings-represented-as-unicode
 abstraction may be in order.
 
 No need to hide Unicode, but talking about a
 
Unicode string
 
 is like talking about an
 
electronic computer

versus a hydraulic computer, a mechanical computer, an optical computer, a
human computer, a genetic (DNA) computer, ... 

visible spectrum display

I'm not sure that many people actually do refer to visible spectrum
display, or what you mean by it, but I can easily imagine that being in
contrast with a non-visible spectrum display.


mouse user interface

As opposed to a commandline user interface, direct brain-to-computer user
interface, touch UI, etc. Not to mention non-user interfaces, like SCSI
interface, SATA interface, USB interface, ...


ethernet socket

Telephone socket, Appletalk socket, Firewire socket, ADB socket ...


magnetic file

I have no idea what you mean here. Do you mean magnetic *field*? As opposed
to an electric field, gravitational field, Higgs field, strong nuclear
force field, weak nuclear force field ...


electric power supply
 
 The language spec calls the things just strings, as it should.


I really don't understand what bothers you about this. In Python, we have
Unicode strings and byte strings. In computing in general, strings can
consist of Unicode characters, ASCII characters, Tron characters, EBCDID
characters, ISO-8859-7 characters, and literally dozens of others. It
boogles my mind that you are so opposed to being explicit about what sort
of string we are dealing with.

Are you equally disturbed when people distinguish between tablespoon,
teaspoon, dessert spoon and serving spoon?



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-21 Thread Marko Rauhamaa

Steven D'Aprano steve+comp.lang.pyt...@pearwood.info:

 In Python, we have Unicode strings and byte strings.

No, you don't. You have strings and bytes:

  Textual data in Python is handled with str objects, or strings.
  Strings are immutable sequences of Unicode code points. String
  literals are written in a variety of ways: [...]

  URL: https://docs.python.org/3/library/stdtypes.html#text-sequence-typ
  e-str

  The core built-in types for manipulating binary data are bytes and bytearray.

  URL: https://docs.python.org/3/library/stdtypes.html#binary-sequence-t
  ypes-bytes-bytearray-memoryview


Equivalently, I wouldn't mind character strings vs byte strings.
Unicode strings is not wrong but the technical emphasis on Unicode is as
strange as a tire car or rectangular door when car and door are
what you usually mean.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

python 2.7 and unicode (one more time)

2014-11-20 Thread Francis Moreau

Hello,

My application is using gettext module to do the translation
stuff. Translated messages are unicode on both python 2 and
3 (with python2.7 I had to explicitely asked for unicode).

A problem arises when formatting those messages before logging
them. For example:

  log.debug(%s: %s % (header, _(will return an unicode string)))

Indeed on python2.7, %s: %s is 'str' whereas _() returns
unicode.

My question is: how should this be fixed properly ?

A simple solution would be to force all strings passed to the
logger to be unicode:

  log.debug(u%s: %s % ...)

and more generally force all string in my code to be unicode by
using the 'u' prefix.

or is there a proper solution ?

Thanks.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau francis.m...@gmail.com wrote:
 My question is: how should this be fixed properly ?

 A simple solution would be to force all strings passed to the
 logger to be unicode:

   log.debug(u%s: %s % ...)

 and more generally force all string in my code to be unicode by
 using the 'u' prefix.

Yep. And then you may want to consider from __future__ import
unicode_literals, which will make string literals represent Unicode
strings rather than byte strings. Basically the same as you're saying,
only without the explicit u prefixes.

This will also make your Py2 code behave more like the way your Py3
code does (as bare string literals are always Unicode strings in Py3).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Francis Moreau wrote:

 Hello,
 
 My application is using gettext module to do the translation
 stuff. Translated messages are unicode on both python 2 and
 3 (with python2.7 I had to explicitely asked for unicode).
 
 A problem arises when formatting those messages before logging
 them. For example:
 
   log.debug(%s: %s % (header, _(will return an unicode string)))

This is only problematic if header is a non-ascii bytestring.

 Indeed on python2.7, %s: %s is 'str' whereas _() returns
 unicode.
 
 My question is: how should this be fixed properly ?
 
 A simple solution would be to force all strings passed to the
 logger to be unicode:
 
   log.debug(u%s: %s % ...)
 
 and more generally force all string in my code to be unicode by
 using the 'u' prefix.
 
 or is there a proper solution ?

You don't need to change an all-ascii bytestring to unicode. 
Lo and behold:

 %s %s % (uüblich, uähnlich)
u'\xfcblich \xe4hnlich'
 u%s %s % (uüblich, uähnlich)
u'\xfcblich \xe4hnlich'

Only non-ascii bytestrings mean trouble, either noisy

 u%s nötig %s % (uüblich, ähnlich)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: 
ordinal not in range(128)
 %s nötig %s % (uüblich, uähnlich)
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: 
ordinal not in range(128)

or silently until you have to decipher the logfile contents. It's best to 
stay away from them, and the

from __future__ unicode_literals

that Chris mentionend is a convenient way to achieve that.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Thu, Nov 20, 2014 at 11:35 PM, Peter Otten __pete...@web.de wrote:
 You don't need to change an all-ascii bytestring to unicode.
 Lo and behold:

 %s %s % (uüblich, uähnlich)
 u'\xfcblich \xe4hnlich'
 u%s %s % (uüblich, uähnlich)
 u'\xfcblich \xe4hnlich'

 Only non-ascii bytestrings mean trouble, either noisy


It's better to not depend on that, though. Be clear and explicit about
the difference between bytes and text, and don't try to pretend
they're the same thing, even for ASCII.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
  %s nötig %s % (uüblich, uähnlich)
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: 
 ordinal not in range(128)

This is surprising to me - why is it trying to decode the format string,
rather than encode the arguments?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Francis Moreau

Hi,

On 11/20/2014 11:47 AM, Chris Angelico wrote:
 On Thu, Nov 20, 2014 at 8:40 PM, Francis Moreau francis.m...@gmail.com 
 wrote:
 My question is: how should this be fixed properly ?

 A simple solution would be to force all strings passed to the
 logger to be unicode:

   log.debug(u%s: %s % ...)

 and more generally force all string in my code to be unicode by
 using the 'u' prefix.
 
 Yep. And then you may want to consider from __future__ import
 unicode_literals, which will make string literals represent Unicode
 strings rather than byte strings. Basically the same as you're saying,
 only without the explicit u prefixes.

Thanks for the from __future__ import unicode_literals trick, it makes
that switch much less intrusive.

However it seems that I will suddenly be trapped by all modules which
are not prepared to handle unicode. For example:

  from __future__ import unicode_literals
  import locale
  locale.setlocale(locale.LC_ALL, 'fr_FR')
 Traceback (most recent call last):
   File stdin, line 1, in module
   File /usr/lib64/python2.7/locale.py, line 546, in setlocale
 locale = normalize(_build_localename(locale))
   File /usr/lib64/python2.7/locale.py, line 453, in _build_localename
 language, encoding = localetuple
 ValueError: too many values to unpack

Is the locale module an exception and in that case I'll fix it by doing:

  locale.setlocale(locale.LC_ALL, b'fr_FR')

or is a (big) part of the modules in python 2.7 still not ready for
unicode and in that case I have to decide which prefix (u or b) I should
manually add ?

Thanks.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 12:59 AM,  random...@fastmail.us wrote:
 On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
  %s nötig %s % (uüblich, uähnlich)
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
 ordinal not in range(128)

 This is surprising to me - why is it trying to decode the format string,
 rather than encode the arguments?

Why should it encode to bytes? Makes much better sense to work in
Unicode. But mainly, it has to do one of them, and be predictable. If
you add a float and an int, you have to predictably get back one of
those two types, and since neither is a perfect superset of the other,
one just has to be picked. (And that's float, because it's more likely
to be the better option.) In this case, picking Unicode to meet on is
easily the better option, because you'll often have pure-ASCII string
literals as format strings, and Unicode data being interpolated into
it. So using an ASCII codec is far more likely to succeed if you
decode the format string than if you encode the data.

Personally, I'd much rather be very clear about what's text and what's
bytes, and not have any automatic encoding at all. That's why I use
Python 3.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 1:14 AM, Francis Moreau francis.m...@gmail.com wrote:
 Hi,

 Thanks for the from __future__ import unicode_literals trick, it makes
 that switch much less intrusive.

 However it seems that I will suddenly be trapped by all modules which
 are not prepared to handle unicode. For example:

   from __future__ import unicode_literals
   import locale
   locale.setlocale(locale.LC_ALL, 'fr_FR')
  Traceback (most recent call last):
File stdin, line 1, in module
File /usr/lib64/python2.7/locale.py, line 546, in setlocale
  locale = normalize(_build_localename(locale))
File /usr/lib64/python2.7/locale.py, line 453, in _build_localename
  language, encoding = localetuple
  ValueError: too many values to unpack

 Is the locale module an exception and in that case I'll fix it by doing:

   locale.setlocale(locale.LC_ALL, b'fr_FR')

 or is a (big) part of the modules in python 2.7 still not ready for
 unicode and in that case I have to decide which prefix (u or b) I should
 manually add ?

Sadly, there are quite a lot of parts of Python 2 that simply don't
handle Unicode strings. But you can probably keep all of those down to
just a handful of explicit bwhatever strings; most places should
accept unicode as well as str. What you're seeing here is a prime
example of one of this author's points (caution, long post):

http://unspecified.wordpress.com/2012/04/19/the-importance-of-language-level-abstract-unicode-strings/

The lesson of Python 3 is: give programmers a Unicode string type,
*make it the default*, and encoding issues will /mostly/ go away.

There's a whole ecosystem to Python 2 - some in the standard library,
heaps more in the rest of the world - and a lot of it was written on
the assumption that a byte is a character is an octet. When you pass
Unicode strings to functions written to expect byte strings, sometimes
you win, and sometimes you lose... even with the standard library
itself. But the Python 3 ecosystem has been written on the assumption
that strings are Unicode. It's only a narrow set of programs
(boundary code, where you're moving text across networks and stuff
like that) where the Python 2 model is easier to work with; and the
recent Py3 releases have been progressively working to relieve that
pain.

The absolute worst case is a function which exists in Python 2 and 3,
and requires a byte string in Py2 and a text string in Py3. Sadly,
that may be exactly what locale.setlocale() is. For that, I would
suggest explicitly passing stuff through str():

locale.setlocale(locale.LC_ALL, str('fr_FR'))

In Python 3, 'fr_FR' is already a str, so passing it through str()
will have no significant effect. (Though it would be worth commenting
that, to make it clear to a subsequent reader that this is Py2 compat
code.) In Python 2 with unicode_literals active, 'fr_FR' is a unicode,
so passing it through str() will encode it to ASCII, producing a byte
string that setlocale should be happy with.

By the way, the reason for the strange error message is clearer in
Python 3, which chains in another exception:

 locale.setlocale(locale.LC_ALL, b'fr_FR')
Traceback (most recent call last):
  File /usr/local/lib/python3.5/locale.py, line 498, in _build_localename
language, encoding = localetuple
ValueError: too many values to unpack (expected 2)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File stdin, line 1, in module
  File /usr/local/lib/python3.5/locale.py, line 594, in setlocale
locale = normalize(_build_localename(locale))
  File /usr/local/lib/python3.5/locale.py, line 507, in _build_localename
raise TypeError('Locale must be None, a string, or an iterable of
two strings -- language code, encoding.')
TypeError: Locale must be None, a string, or an iterable of two
strings -- language code, encoding.

So when it gets the wrong type of string, it attempts to unpack it as
an iterable; it yields five values (the five bytes or characters,
depending on which way it's the wrong type of string), but it's
expecting two. Fortunately, str() will deal with this. But make sure
you don't have the b prefix, or str() in Py3 will give you quite a
different result!

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

random...@fastmail.us wrote:

 On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
  %s nötig %s % (uüblich, uähnlich)
 Traceback (most recent call last):
   File stdin, line 1, in module
 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
 ordinal not in range(128)
 
 This is surprising to me - why is it trying to decode the format string,
 rather than encode the arguments?

Probably to make it easier to mix byte and unicode strings. In hindsight it 
may not have been a good idea, but it had the potential to save some memory.

I think that you may get a Unicode/Encode/Error when you try to /decode/ a 
unicode string is more confusing...


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to /decode/ a
 unicode string is more confusing...

Hang on a minute, what does it even mean to decode a Unicode string?
That's where the problem is. Fortunately that's one that Py3 solved -
str simply doesn't have a decode() method.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to /decode/
 a unicode string is more confusing...
 
 Hang on a minute, what does it even mean to decode a Unicode string?

Let's not get philosophical ;)

 That's where the problem is. Fortunately that's one that Py3 solved -
 str simply doesn't have a decode() method.



-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten __pete...@web.de wrote:
 Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to /decode/
 a unicode string is more confusing...

 Hang on a minute, what does it even mean to decode a Unicode string?

 Let's not get philosophical ;)

No, I'm quite serious. You encode Unicode text into bytes; you decode
bytes into text. You can also encode a floating-point value into
bytes, and decode bytes into a float. Or you could encode a large and
complex structure into bytes, using something like pickle or json, and
then decode those bytes later on. The pattern is always the same: the
abstract object with meaning to a human is encoded into a concrete
form that a computer can handle, and the concrete is decoded into the
abstract. If you're not good at sight-reading sheet music, you'll have
the same feeling of staring at the dots, decoding them one by one into
this abstract thing called music, and then being able to work with
it.

When you try to decode a Unicode string, what happens is that Python 2
says Oh, you're trying to do a byte-string operation on a Unicode
string... I'll quickly encode that to bytes for you, then do what you
asked. That's why you can get an *en*coding error when you asked to
*de*code - because both operations have to happen.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Michael Torrie

On 11/20/2014 09:32 AM, Peter Otten wrote:
 Chris Angelico wrote:
 
 On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to /decode/
 a unicode string is more confusing...

 Hang on a minute, what does it even mean to decode a Unicode string?
 
 Let's not get philosophical ;)

It's not philosophical.  It's an important distinction that folks need
to be clear on when understanding unicode and the errors that python can
throw.

Unicode can only be encoded to bytes.
Bytes can only be decoded to unicode.

Without understanding that, the exception errors about decoding won't be
properly understood, nor will one know how to fix them.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 3:32 AM, Peter Otten __pete...@web.de wrote:
 Chris Angelico wrote:

 On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to
 /decode/ a unicode string is more confusing...

 Hang on a minute, what does it even mean to decode a Unicode string?

 Let's not get philosophical ;)
 
 No, I'm quite serious. 

I'm sorry I'm limited to text, otherwise I would have formatted the

;) as 30pt blinking magenta...

 You encode Unicode text into bytes; you decode
 bytes into text. You can also encode a floating-point value into
 bytes, and decode bytes into a float. Or you could encode a large and
 complex structure into bytes, using something like pickle or json, and
 then decode those bytes later on. The pattern is always the same: the
 abstract object with meaning to a human is encoded into a concrete
 form that a computer can handle, and the concrete is decoded into the
 abstract. If you're not good at sight-reading sheet music, you'll have
 the same feeling of staring at the dots, decoding them one by one into
 this abstract thing called music, and then being able to work with
 it.
 
 When you try to decode a Unicode string, what happens is that Python 2
 says Oh, you're trying to do a byte-string operation on a Unicode
 string... I'll quickly encode that to bytes for you, then do what you
 asked. That's why you can get an *en*coding error when you asked to
 *de*code - because both operations have to happen.

In an alternative universe unicode.decode() could have been implemented as a 
no-op. 

As you put it it looks like you have to find the true nature of the problem 
and then cast it into code -- a kind of essentialism. I would rather 
emphasise the process; the evolving interface changes your view on the 
underlying problem -- a hermeneutic cycle if you will.

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:
 On Fri, Nov 21, 2014 at 12:59 AM,  random...@fastmail.us wrote:
  On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
   %s nötig %s % (uüblich, uähnlich)
  Traceback (most recent call last):
File stdin, line 1, in module
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4:
  ordinal not in range(128)
 
  This is surprising to me - why is it trying to decode the format string,
  rather than encode the arguments?
 
 Why should it encode to bytes?

Because a bytes format string suggests a bytes result. Why does unicode
always win, rather than the type of the format string always winning?

 Makes much better sense to work in
 Unicode. But mainly, it has to do one of them, and be predictable.

Yeah, but string % is not a symmetrical operator. People's mental model
of it is likely to be that it acts like format (which does use the type
of the format string) or C sprintf/wsprintf (both of which use the same
type for the format string and result). And literally every other type
is converted to the type of the format string when used with %s - having
unicode be special adds cognitive load, and it means you can't safely
blindly use %s with an unknown object.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ian Kelly

On Thu, Nov 20, 2014 at 10:42 AM,  random...@fastmail.us wrote:
 and it means you can't safely
 blindly use %s with an unknown object.

You can't safely do this anyway. Whether it's %s with a str and a
unicode, or %s with a unicode and a str, *something* is going to have
to be implicitly encoded or decoded, and if ascii doesn't happen to be
the correct encoding then the result will be either an error or a
silent failure.
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ian Kelly

On Thu, Nov 20, 2014 at 11:06 AM, Ian Kelly ian.g.ke...@gmail.com wrote:
 On Thu, Nov 20, 2014 at 10:42 AM,  random...@fastmail.us wrote:
 and it means you can't safely
 blindly use %s with an unknown object.

 You can't safely do this anyway. Whether it's %s with a str and a
 unicode, or %s with a unicode and a str, *something* is going to have
 to be implicitly encoded or decoded, and if ascii doesn't happen to be
 the correct encoding then the result will be either an error or a
 silent failure.

Also note that if you use %r instead of %s, you'll get the result you
want (although the unicode string will be quoted rather than encoded).
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Peter Otten

random...@fastmail.us wrote:

 On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:
 On Fri, Nov 21, 2014 at 12:59 AM,  random...@fastmail.us wrote:
  On Thu, Nov 20, 2014, at 07:35, Peter Otten wrote:
   %s nötig %s % (uüblich, uähnlich)
  Traceback (most recent call last):
File stdin, line 1, in module
  UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position
  4: ordinal not in range(128)
 
  This is surprising to me - why is it trying to decode the format
  string, rather than encode the arguments?
 
 Why should it encode to bytes?
 
 Because a bytes format string suggests a bytes result. Why does unicode
 always win, rather than the type of the format string always winning?

My guess is that when unicode was introduced the decision to propagate str 
to unicode in some cases was made because the developers expected that more 
old code that was unaware of unicode would continue to work. 

The old methods __mod__(), replace(), and join() that conceptually deal with 
strings propate while those that deal with characters -- center(), 
r/ljust(), translate() -- dont.

The newer format() method doesn't propagate which is probably due to a 
change in attitude rather than an oversight.


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Michael Torrie torr...@gmail.com:

 Unicode can only be encoded to bytes.
 Bytes can only be decoded to unicode.

I don't really like it how Unicode is equated with text, or even
character strings.

There's barely any difference between the truth value of these
statements:

   Python strings are ASCII.

   Python strings are Latin-1.

   Python strings are Unicode.

Each of those statements is true as long as you stay within the
respective character sets, and cease to be true when your text contains
characters outside the character sets.

Now, it is true that Python currently limits itself to the 1,114,112
Unicode code points. And it likely won't adopt more characters unless
Unicode does it first. However, text is something more lofty and
abstract than a sequence of Unicode code points.

We shouldn't call strings Unicode any more than we call numbers IEEE or
times ISO.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Mark Lawrence


On 20/11/2014 18:06, Ian Kelly wrote:

On Thu, Nov 20, 2014 at 10:42 AM,  random...@fastmail.us wrote:

and it means you can't safely
blindly use %s with an unknown object.


You can't safely do this anyway. Whether it's %s with a str and a
unicode, or %s with a unicode and a str, *something* is going to have
to be implicitly encoded or decoded, and if ascii doesn't happen to be
the correct encoding then the result will be either an error or a
silent failure.



All I know about this encoding/decoding malarky is that I'd prefer an 
error to a silent failure any day of the week.


--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

--
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Ethan Furman

On 11/20/2014 07:53 AM, Chris Angelico wrote:
 On Fri, Nov 21, 2014 at 2:40 AM, Peter Otten __pete...@web.de wrote:
 I think that you may get a Unicode/Encode/Error when you try to /decode/ a
 unicode string is more confusing...
 
 Hang on a minute, what does it even mean to decode a Unicode string?
 That's where the problem is. Fortunately that's one that Py3 solved -
 str simply doesn't have a decode() method.

If your unicode string happens to contain a base64 encoded .png, then you could 
decode that into bytes.  ;)

--
~Ethan~



signature.asc
Description: OpenPGP digital signature
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Ethan Furman et...@stoneleaf.us:

 If your unicode string happens to contain a base64 encoded .png, then
 you could decode that into bytes. ;)

You could embed your PNG file in XML in binary form as CDATA. Then, your
characters would represent 8- or 16-bit integers. You just need to
replace all accidental occurrences of 

   ]]

with

   ![CDATA[


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832



On Thu, Nov 20, 2014, at 16:29, Ethan Furman wrote:
 If your unicode string happens to contain a base64 encoded .png, then you
 could decode that into bytes.  ;)

Bytes of the PNG, or of the raw pixels?
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 4:42 AM,  random...@fastmail.us wrote:
 On Thu, Nov 20, 2014, at 09:59, Chris Angelico wrote:

 Why should it encode to bytes?

 Because a bytes format string suggests a bytes result. Why does unicode
 always win, rather than the type of the format string always winning?

For the same reason that float always wins:

 1.0 + 2
3.0
 1 + 2.0
3.0

 Makes much better sense to work in
 Unicode. But mainly, it has to do one of them, and be predictable.

 Yeah, but string % is not a symmetrical operator. People's mental model
 of it is likely to be that it acts like format (which does use the type
 of the format string) or C sprintf/wsprintf (both of which use the same
 type for the format string and result). And literally every other type
 is converted to the type of the format string when used with %s - having
 unicode be special adds cognitive load, and it means you can't safely
 blindly use %s with an unknown object.

True, but Python 2 deliberately lets you conflate the two, so you get
a bit of convenience at the expensive of complexity when things go
wrong. Python 3, on the other hand, is much more careful about the
difference:

 asdf %s qwer % bzxcv
asdf b'zxcv' qwer
 basdf %s qwer % zxcv
Traceback (most recent call last):
  File stdin, line 1, in module
TypeError: unsupported operand type(s) for %: 'bytes' and 'str'

So your complaint *has* been resolved... but only in Python 3, because
the change would break stuff.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 Michael Torrie torr...@gmail.com:

 Unicode can only be encoded to bytes.
 Bytes can only be decoded to unicode.

 I don't really like it how Unicode is equated with text, or even
 character strings.

 There's barely any difference between the truth value of these
 statements:

Python strings are ASCII.

Python strings are Latin-1.

Python strings are Unicode.

 Each of those statements is true as long as you stay within the
 respective character sets, and cease to be true when your text contains
 characters outside the character sets.

The difference is that ASCII and Latin-1 cut out a large number of
active world languages, UCS-2 (the intermediate option you didn't
mention) cuts out a small proportion (by usage) of significant
characters, and Unicode cuts out only those characters which fall
under issues like Han unification. (Plus any that haven't yet been
allocated. But since Python doesn't actually validate code points to
ensure that they've been given meanings, you can use today's Python to
work with tomorrow's Unicode.)

Do you have actual text that you're unable to represent in Unicode? If
so, you are going to have major problems using it with *any* computer
system. There are Japanese encodings that can represent additional
characters, but they also *cannot* represent a lot of the other
characters we use, so there'll be fundamental incompatibilities.

 Now, it is true that Python currently limits itself to the 1,114,112
 Unicode code points. And it likely won't adopt more characters unless
 Unicode does it first. However, text is something more lofty and
 abstract than a sequence of Unicode code points.

 We shouldn't call strings Unicode any more than we call numbers IEEE or
 times ISO.

We don't call numbers IEEE, but if we're working with Python floats,
we *do* require all numbers to be representable as IEEE
floating-point. Don't like that? Pick decimal.Decimal instead, or
fractions.Fraction, and pick a different set of limitations... but
ultimately, you *will* have restrictions - and much tighter
restrictions than Unicode places on text.

Do you genuinely have text that you can't represent in Unicode, or are
you just arguing against Unicode to try to justify Python strings are
something else as a basis for your code?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Steven D'Aprano

Marko Rauhamaa wrote:

 Michael Torrie torr...@gmail.com:
 
 Unicode can only be encoded to bytes.
 Bytes can only be decoded to unicode.
 
 I don't really like it how Unicode is equated with text, or even
 character strings.

That surely depends on the context. To be technically correct, Unicode is a
character set together with a set of rules for dealing with them (e.g.
rules for uppercasing characters, sorting rules, etc.). When referring to
the standard, Unicode is a noun; when referring to text, it is actually
an adjective being used as a noun. That is, Unicode text has become
abbreviated as just Unicode in much the same way as human beings has
become abbreviated as just humans.

In that sense, text is Unicode just means in the context in which we are
talking, when I say 'text' I mean 'Unicode text' as opposed to (for
example) 'ASCII text' or 'KOI-8 text'. It certainly doesn't mean that
*all* text in other contexts are Unicode, since that is obviously untrue.

(E.g. there are millions of existing files across the world containing text
which use legacy encodings that are not compatible with Unicode.)


 There's barely any difference between the truth value of these
 statements:
 
Python strings are ASCII.
 
Python strings are Latin-1.
 
Python strings are Unicode.
 
 Each of those statements is true as long as you stay within the
 respective character sets, and cease to be true when your text contains
 characters outside the character sets.

When we say Python strings are FOO, we are making a statement about
arbitrary Python strings, not a particular set of concrete examples of
strings. If Python strings are FOO, that means that for all possible Python
strings s, s is FOO is a true statement.

We cannot say that Python strings are uppercase, because we can easily find
counter-examples such as 'xyz'. Likewise we cannot say Python strings are
ASCII, or Latin-1, because we can easily find counter-examples such as 'Ř'

On the other hand, Python strings *are* Unicode, because by design Python
strings are limited to Unicode. Every Python string is a Unicode string.


 Now, it is true that Python currently limits itself to the 1,114,112
 Unicode code points. And it likely won't adopt more characters unless
 Unicode does it first. However, text is something more lofty and
 abstract than a sequence of Unicode code points.

You are certainly correct that in it's full generality, text is much more
than just a string of code points. Unicode strings is a primitive data
type. A powerful and sophisticated text processing application may even
find Python strings too primitive, possibly needing something like ropes of
graphemes rather than strings of code points.

We Western and Northern European speakers -- and I don't know whether Finns
are counted as Northern Europeans or Eastern Europeans -- are lucky in that
our natural languages are well-covered by Unicode. All our graphemes are
also code points, even the funny ones with accents. As an English
speaker. I have to remind myself that not every grapheme is a single code
point, but Devanagari or Navajo writers will never make that mistake.


 We shouldn't call strings Unicode any more than we call numbers IEEE or
 times ISO.

We certainly shouldn't call numbers IEEE, but we might very well call them
IEEE-754. Actually, since IEEE-754 covers multiple formats, we have to be
more specific:

Python floats are IEEE-754 double-precision binary floats.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 11:32 AM, Steven D'Aprano
steve+comp.lang.pyt...@pearwood.info wrote:
 (E.g. there are millions of existing files across the world containing text
 which use legacy encodings that are not compatible with Unicode.)

Not compatible with Unicode? There aren't many character sets out
there that include characters not in Unicode - that was the whole
point. Of course, there are plenty of files in unspecified eight-bit
encodings, so you may have a problem with reliable decoding - but if
you know what the encoding is, you ought to be able to represent each
character in Unicode.

Not compatible with any of the UTFs, that's different. Plenty of that
in the world.

 You are certainly correct that in it's full generality, text is much more
 than just a string of code points. Unicode strings is a primitive data
 type. A powerful and sophisticated text processing application may even
 find Python strings too primitive, possibly needing something like ropes of
 graphemes rather than strings of code points.

That's probably more an efficiency point, though. It should be
possible to do a perfect two-way translation between your grapheme
rope and a Python string; otherwise, you'll have great difficulty
saving your file to the disk (which will normally involve representing
the text in Unicode, then encoding that to bytes).

To be sure, a Python string is a poor representational form for a text
editor. But that's largely because it's immutable, so every little
edit would involve massive copying. Depending on what you're doing, it
might be worth using a chunked UTF-8 byte stream (allowing for
insertion at any chunk boundary), or an array of lines, or something
grapheme-based... but all of those questions are performance, not
correctness, issues.

 We Western and Northern European speakers -- and I don't know whether Finns
 are counted as Northern Europeans or Eastern Europeans -- are lucky in that
 our natural languages are well-covered by Unicode. All our graphemes are
 also code points, even the funny ones with accents. As an English
 speaker. I have to remind myself that not every grapheme is a single code
 point, but Devanagari or Navajo writers will never make that mistake.

I've been working with different languages a bit, lately. Broadly
speaking, you have:

1) Languages which use the Roman alphabet, plus a handful of other
characters (eg Finnish, German). These can be represented largely in
ASCII, and used to be handled fairly easily with a single codepage -
an eight-bit ASCII-compatible encoding.

2) Languages which use a different alphabet (eg Cyrillic - Russian,
Bulgarian). You could possibly cram them into an eight-bit encoding
without tipping ASCII out, but I'm not sure. In Unicode, these
languages are all easily supported by the BMP, as they don't use a
huge number of characters each.

3) Languages which use a non-alphabetic system (eg Korean). I think
they're all still covered by the BMP, but there's no way you can fit
them into eight-bit encodings - one single language will use more than
256 symbols.

4) Ancient, esoteric, or symbolic writing systems. Not fundamentally
different from the above categories except that they're less used, and
the BMP has finite space. These will definitely need the SMP.

But all of them are covered by Unicode. (Sadly, they are NOT all
covered by all fonts, so I've been finding that certain pieces of text
come out as strings of little boxes. But I can at least manipulate the
text, even if I can't read it back.) I can, for example, zip lines of
text like this:

English:
Let it go, let it go!
I am one with the wind and sky
Let it go, let it go!
You'll never see me cry!

Icelandic:
Þetta er nóg, þetta er nóg
Uppi í himni eins og vindablær
Þetta er nóg, komið nóg
Og tár mín enginn sér fær

Russian:
Отпусти и забудь,
Этот мир из твоих грёз.
Отпусти и забудь,
И не будет больше слёз.


Output:
Let it go, let it go!
Þetta er nóg, þetta er nóg
Отпусти и забудь,

I am one with the wind and sky
Uppi í himni eins og vindablær
Этот мир из твоих грёз.

Let it go, let it go!
Þetta er nóg, komið nóg
Отпусти и забудь,

You'll never see me cry!
Og tár mín enginn sér fær
И не будет больше слёз.


In fact, it's trivially easy to write something like this, because all
this text is Unicode. ALL of these languages (and plenty more) are
well-covered by Unicode. There's still the ongoing debate of Han
unification, plus the progressive work of adding characters for
ancient scripts and such, but AFAIK, all writing systems currently in
use are covered.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread random832

On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
 2) Languages which use a different alphabet (eg Cyrillic - Russian,
 Bulgarian). You could possibly cram them into an eight-bit encoding
 without tipping ASCII out, but I'm not sure. In Unicode, these
 languages are all easily supported by the BMP, as they don't use a
 huge number of characters each.

There are numerous eight-bit encodings that support latin and one other
alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
encoding is basically two seven-bit encodings.

The most difficult (of those still possible at all) language to encode
in eight bits is actually Vietnamese, which uses the Latin alphabet, due
to the sheer number of accented letters used. Windows' encoding of it
(along with some other lesser used encodings, all for Vietnamese) is the
only 8-bit encoding to use combining accents, in a way unfortunately
incompatible with unicode normalization if naively translated, whereas
VISCII sacrifices a handful of C0 control characters in addition to
fully packing the high half with letters.


-- 
Random832
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 12:31 PM,  random...@fastmail.us wrote:
 On Thu, Nov 20, 2014, at 20:10, Chris Angelico wrote:
 2) Languages which use a different alphabet (eg Cyrillic - Russian,
 Bulgarian). You could possibly cram them into an eight-bit encoding
 without tipping ASCII out, but I'm not sure. In Unicode, these
 languages are all easily supported by the BMP, as they don't use a
 huge number of characters each.

 There are numerous eight-bit encodings that support latin and one other
 alphabet. Remember, ASCII is a seven-bit encoding, and an eight-bit
 encoding is basically two seven-bit encodings.

I'm aware of this; Greek, for instance, fits quite happily into
ISO-8859-7, which is eight-bit.

 The most difficult (of those still possible at all) language to encode
 in eight bits is actually Vietnamese, which uses the Latin alphabet, due
 to the sheer number of accented letters used. Windows' encoding of it
 (along with some other lesser used encodings, all for Vietnamese) is the
 only 8-bit encoding to use combining accents, in a way unfortunately
 incompatible with unicode normalization if naively translated, whereas
 VISCII sacrifices a handful of C0 control characters in addition to
 fully packing the high half with letters.

This is what I was suspicious of. The very notion of combining
accents already breaks the notion that a byte is a character is a
glyph, which most eight-bit encodings try to pretend. In any case,
the BMP still easily copes with them all.

(Hmm. I wonder how you'd typeset the old Self-Pronouncing Alphabet
for English? It's basically English text with a few markings added to
letters - not standard diacriticals that already exist in Unicode, but
dots. Probably possible, one way or another... but I haven't seen SPA
text since the 90s, and that was in stuff published back in the 80s or
so.)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Chris Angelico ros...@gmail.com:

 On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 I don't really like it how Unicode is equated with text, or even
 character strings.
 [...]
 Do you have actual text that you're unable to represent in Unicode?

Not my point at all.

I'm saying equating an abstract data type (string) with its
representation (Unicode vector) is bad taste.

 We don't call numbers IEEE,

Exactly.

 Do you genuinely have text that you can't represent in Unicode, or are
 you just arguing against Unicode to try to justify Python strings are
 something else as a basis for your code?

Nobody is arguing against Unicode. I'm saying, let's talk about the
forest instead of the trees (except when the trees really are the
focus).


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 On Fri, Nov 21, 2014 at 5:56 AM, Marko Rauhamaa ma...@pacujo.net wrote:
 I don't really like it how Unicode is equated with text, or even
 character strings.
 [...]
 Do you have actual text that you're unable to represent in Unicode?

 Not my point at all.

 I'm saying equating an abstract data type (string) with its
 representation (Unicode vector) is bad taste.

What about sequence of Unicode code points is representation? What
is your abstraction over that?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Marko Rauhamaa

Chris Angelico ros...@gmail.com:

 On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 I'm saying equating an abstract data type (string) with its
 representation (Unicode vector) is bad taste.

 What about sequence of Unicode code points is representation? What
 is your abstraction over that?

The letter 'A' is a character. Unicode for the letter 'A' is 65. It is
very rarely that you care about that number. You are only interested in
the letter 'A', which you can use to spell people's names, for instance.

When you read a book, you read the text, not the ink.


Marko
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: python 2.7 and unicode (one more time)

2014-11-20 Thread Chris Angelico

On Fri, Nov 21, 2014 at 6:14 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 Chris Angelico ros...@gmail.com:

 On Fri, Nov 21, 2014 at 5:36 PM, Marko Rauhamaa ma...@pacujo.net wrote:
 I'm saying equating an abstract data type (string) with its
 representation (Unicode vector) is bad taste.

 What about sequence of Unicode code points is representation? What
 is your abstraction over that?

 The letter 'A' is a character. Unicode for the letter 'A' is 65. It is
 very rarely that you care about that number. You are only interested in
 the letter 'A', which you can use to spell people's names, for instance.

 When you read a book, you read the text, not the ink.

Then you need to read more about Unicode. The *codepoint* for the
letter 'A' is 65. That is not Unicode, that is one part of the Unicode
spec.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

79 matches

Mail list logo