Re: Trouble saving unicode text to file

2005-05-11 Thread Thomas Bellman
=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= [EMAIL PROTECTED] wrote:

Thomas Bellman wrote:
 Fixed-with characters *do* have advantages, even in the external
 representation.  With fixed-with characters you don't have to
 parse the entire file or stream in order to read the Nth character;
 instead you can skip or seek to an octet position that can be
 calculated directly from N.

 OTOH, encodings that are free of null bytes and ASCII compatible
 also have advantages.

Indeed, indeed.  But that's no reason to choose UTF-16 over UTF-32,
since you don't get those advantages then.

 And not the least, UTF-32 is *beautiful* compared to UTF-16.

 But ugly compared to UTF-8. Not only does it have the null byte
 and the ASCII incompatibility problem, but it also has the
 endianness problem. So for exchanging Unicode between systems,
 I can see no reason to use anything but UTF-8 (unless, of course,
 one end, or the protocol, already dictates a different encoding).

UTF-8 beats UTF-32 in the practicality department, due to its
compatibility with legacy software, but in my opinion UTF-32 wins
over UTF-8 for shear beauty, even with the endianness problem.

I do wish they had standardized on one single endianness for UTF-32
(and UTF-16), instead of allowing both to exist.  In the mid 1990's
I had to work with files in the TIFF format, which allows both
endianesses.  The specification *requires* you to read both, but it
was a rare sight to find MS Windows software that didn't barf on
big endian TIFF files. :-(  Unix software tended to be better at
reading both endians, but generally wrote in the native format,
meaning big endian on Sun Sparc.  Luckily I could convert files
using tiffcp on our Unix machines, but it was irritating to have to
introduce that extra step.  I fully expect the same problem to
happen with UTF-16 and UTF-32 too.

Anyway, back to UTF, my complaint is that UTF-16 doesn't give you
the advantages of *either* UTF-8, nor UTF-32, so if you have the
choice, UTF-16 is always the worst alternative of those three.  I
see no reason to recommend UTF-16 at all.


-- 
Thomas Bellman,   Lysator Computer Club,   Linköping University,  Sweden
God is real, but Jesus is an integer.  !  bellman @ lysator.liu.se
 !  Make Love -- Nicht Wahr!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Trouble saving unicode text to file

2005-05-10 Thread Thomas Bellman
John Machin [EMAIL PROTECTED] writes:

 Which raises a question: who or what is going to read your file? If a
 Unicode-aware application, and never a human, you might like to
 consider encoding the text as utf-16.

Why would one want to use an encoding that is neither semi-compatible
with ASCII (the way UTF-8 is), nor uses fixed-with characters (like
UTF-32 does)?


-- 
Thomas Bellman,   Lysator Computer Club,   Linköping University,  Sweden
You are in a twisty little passage of   !  bellman @ lysator.liu.se
 standards, all conflicting.!  Make Love -- Nicht Wahr!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-10 Thread John Machin
On Tue, 10 May 2005 07:59:31 + (UTC), Thomas Bellman
[EMAIL PROTECTED] wrote:

John Machin [EMAIL PROTECTED] writes:

 Which raises a question: who or what is going to read your file? If a
 Unicode-aware application, and never a human, you might like to
 consider encoding the text as utf-16.

Why would one want to use an encoding that is neither semi-compatible
with ASCII (the way UTF-8 is), nor uses fixed-with characters (like
UTF-32 does)?

UTF-32 is yet another encoding. You still need to decode it into the
internal form supported by your processing software. With UTF-32xE,
you can only skip the decoding step when file's x == software's x and
your software uses 32 bits internally.

Python (2.4.1) doesn't have a utf_32 codec. Perhaps that's because
there isn't much call for it (yet). Let's pretend there is such a
codec in Python.

Once you have done codecs.open('inputfile', 'rb', 'utf_32') or
receivedstring.decode('utf_32'), what do you care whether your
*external representation* has fixed-width characters or not?

Putting it another way, any advantage of fixed-width characters is to
be found in *internal* storage, not *external* transmission or
storage. 

At the other end, if you don't have to squeeze your data through an
8-bit-wide non-binary channel, and you have no need for legibility to
humans, then the remaining considerations are efficiency and (if you
have no control over what's used at the other end) whether the
necessary codec is widely implemented. 

So rather than utf-16, perhaps I should have written something like:

Consider utf-8 or utf-16. Consider following this by compression using
a widely-implemented protocol (gzip/zip/bzip2).


Cheers,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-10 Thread Thomas Bellman
John Machin [EMAIL PROTECTED] wrote:

 UTF-32 is yet another encoding.
[...]
 Once you have done codecs.open('inputfile', 'rb', 'utf_32') or
 receivedstring.decode('utf_32'), what do you care whether your
 *external representation* has fixed-width characters or not?

 Putting it another way, any advantage of fixed-width characters is to
 be found in *internal* storage, not *external* transmission or
 storage. 

 At the other end, if you don't have to squeeze your data through an
 8-bit-wide non-binary channel, and you have no need for legibility to
 humans, then the remaining considerations are efficiency and (if you
 have no control over what's used at the other end) whether the
 necessary codec is widely implemented. 

So, are you saying that any encoding that handles all the needed
characters are equally good choices?  So why not choose UTF-7?
Or Punycode?

Should you never care what the black box you are using looks like
on the inside?  Hadn't it mattered if X.400 won over SMTP?  Both
protocols are somewhat capable of sending emails after all; X.400
is just a bit more complicated on the inside where normal users
don't see.


Fixed-with characters *do* have advantages, even in the external
representation.  With fixed-with characters you don't have to
parse the entire file or stream in order to read the Nth character;
instead you can skip or seek to an octet position that can be
calculated directly from N.

In-place editing of single characters in large files becomes more
efficient.

The codec for UTF-32 is extremely simple.  There are no illegal
sequences to care about, like there are in UTF-8 and UTF-16, just
illegal single 32-bit values (those that are larger than 0x10).

And not the least, UTF-32 is *beautiful* compared to UTF-16.


-- 
Thomas Bellman,   Lysator Computer Club,   Linköping University,  Sweden
Adde parvum parvo magnus acervus erit   ! bellman @ lysator.liu.se
  (From The Mythical Man-Month)   ! Make Love -- Nicht Wahr!
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Trouble saving unicode text to file

2005-05-10 Thread Martin v. Löwis
Thomas Bellman wrote:
 Fixed-with characters *do* have advantages, even in the external
 representation.  With fixed-with characters you don't have to
 parse the entire file or stream in order to read the Nth character;
 instead you can skip or seek to an octet position that can be
 calculated directly from N.

OTOH, encodings that are free of null bytes and ASCII compatible
also have advantages.

 And not the least, UTF-32 is *beautiful* compared to UTF-16.

But ugly compared to UTF-8. Not only does it have the null byte
and the ASCII incompatibility problem, but it also has the
endianness problem. So for exchanging Unicode between systems,
I can see no reason to use anything but UTF-8 (unless, of course,
one end, or the protocol, already dictates a different encoding).

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread F. Petitjean
Le Mon, 09 May 2005 08:39:40 +1000, John Machin a écrit :
 On Sun, 08 May 2005 19:49:42 +0200, Martin v. Löwis
[EMAIL PROTECTED] wrote:
 
John Machin wrote:
 Martin, I can't guess the reason for this last suggestion; why should
 a Windows system use iso-8859-1 instead of cp1252?

Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.
 
 1. When exchanging data across systems, should not utf-8 be
 preferred???
 
 2. If the Windows *users* have been using characters that are in
 cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
 will cause an exception. 
 
 euro_win = chr(128)
 euro_uc = euro_win.decode('cp1252')
 euro_uc
 u'\u20ac'
 unicodedata.name(euro_uc)
 'EURO SIGN'
 euro_iso = euro_uc.encode('iso-8859-1')
 Traceback (most recent call last):
   File stdin, line 1, in ?
 UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac'
 in position 0: ordinal not in range(256)

 
 I find it a bit hard to imagine that the euro sign wouldn't get a fair
 bit of usage in Swedish data processing even if it's not their own
 currency.
For western Europe countries, another codec exists which includes the
'EURO SIGN'. It is spelled 'iso8859_15' (with an alias 'iso-8859-15'
according to the 4.9.2 Standard Encodings page of the python library
reference).
euro_iso = euro_uc.encode('iso8859_15')
 euro_iso
'\xa4'
 
 3. How portable is a character set that doesn't include the euro sign?
I think it is due to historical constraints : isoLatin1 existed before
that the EURO SIGN appeared.
 
 Regards,
 John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread Fredrik Lundh
John Machin wrote:

 I find it a bit hard to imagine that the euro sign wouldn't get a fair
 bit of usage in Swedish data processing even if it's not their own
 currency.

it's spelled Euro or EUR in swedish.

(if you live in a country that use letters to represent its own currency,
you tend to prefer letters for foreign currencies as well)

(I just noticed that there's no euro sign on my swedish keyboard.  I've
never missed it ;-)

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread Max M
Fredrik Lundh wrote:

 (I just noticed that there's no euro sign on my swedish keyboard.  I've
 never missed it ;-)

It's probably AltGR + E like here in DK

-- 

hilsen/regards Max M, Denmark

http://www.mxm.dk/
IT's Mad Science
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread Simon Brunning
On 5/9/05, Max M [EMAIL PROTECTED] wrote:
 Fredrik Lundh wrote:
 
  (I just noticed that there's no euro sign on my swedish keyboard.  I've
  never missed it ;-)
 
 It's probably AltGR + E like here in DK

My UK keyboard has it as AltGr + 4, FWIW.

-- 
Cheers,
Simon B,
[EMAIL PROTECTED],
http://www.brunningonline.net/simon/blog/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread Fredrik Lundh
Max M wrote:

 (I just noticed that there's no euro sign on my swedish keyboard.  I've
 never missed it ;-)

 It's probably AltGR + E like here in DK

ah, there it is.  almost entirely worn out.  and it doesn't work.  but a little
fooling around reveals that AltGr+5 does work.  oh well, you learn some-
thing new every day.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-09 Thread Martin v. Löwis
John Machin wrote:
 Terminology disambiguation: what I call users wouldn't know what
 'cp1252' and 'iso-8859-1' were. They're not expected to know. They
 just type in whatever characters they can see on their keyboard or
 find in the charmap utility. It's what I'd call 'admins' and
 'developers' who should know better, but often don't.

I was talking about 'users' of Python, so they are 'developers'.
They often don't know what cp1252 is.

 1. When exchanging data across systems, should not utf-8 be
 preferred???

It depends on the data, of course. People writing UTF-8 into
text files often find that their editors don't display them
correctly, in which case UTF-8 might not be the best choice.
For example, the Python source code in CVS is required to be
iso-8859-1, primarily because this is what interoperates best
across all development platforms.

For data in XHTML, the answer would be different: every XML
processor is supposed to support UTF-8.

 2. If the Windows *users* have been using characters that are in
 cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
 will cause an exception. 

Correct.

 I find it a bit hard to imagine that the euro sign wouldn't get a fair
 bit of usage in Swedish data processing even if it's not their own
 currency.

Yes, so the question is how to represent it. It all depends on the
application, but it is safer to only assume iso-8859-1 for the moment,
unless it is guaranteed that all code that reads the file in really
knows what cp1252 is, and what \x80 means in that charset.

 3. How portable is a character set that doesn't include the euro sign?

Well, how portable is ASCII? It doesn't support certain characters,
sure. If you don't need these characters, this is not a problem. If
you do need the extra characters, you need to think thoroughly what
encoding meets your needs best. I was merely suggesting that cp1252
is often used without that thought, causing moji-bake later.

If representation of the euro sign is an issue, the choices are
iso-8859-15, cp1252, and UTF-8. Of those three, I would pick
cp1252 last if at all possible, because it is specific to a
vendor (i.e. non-standard)

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-08 Thread Martin v. Löwis
Svennglenn wrote:
 # -*- coding: cp1252 -*-
 
 titel = åäö
 titel = unicode(titel)

Instead of this, just write

# -*- coding: cp1252 -*-

titel = uåäö

 fil = open(testfil.txt, w)
 fil.write(titel)
 fil.close()

Instead of this, write

import codecs
fil = codecs.open(testfil.txt, w, cp1252)
fil.write(titel)
fil.close()

Instead of cp1252, consider using ISO-8859-1.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-08 Thread John Machin
On Sun, 08 May 2005 11:23:49 +0200, Martin v. Löwis
[EMAIL PROTECTED] wrote:

Svennglenn wrote:
 # -*- coding: cp1252 -*-
 
 titel = åäö
 titel = unicode(titel)

Instead of this, just write

# -*- coding: cp1252 -*-

titel = uåäö

 fil = open(testfil.txt, w)
 fil.write(titel)
 fil.close()

Instead of this, write

import codecs
fil = codecs.open(testfil.txt, w, cp1252)
fil.write(titel)
fil.close()

Instead of cp1252, consider using ISO-8859-1.

Martin, I can't guess the reason for this last suggestion; why should
a Windows system use iso-8859-1 instead of cp1252?

Regards,
John


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-08 Thread Martin v. Löwis
John Machin wrote:
 Martin, I can't guess the reason for this last suggestion; why should
 a Windows system use iso-8859-1 instead of cp1252?

Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.

Regards,
Martin
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-08 Thread John Machin
On Sun, 08 May 2005 19:49:42 +0200, Martin v. Löwis
[EMAIL PROTECTED] wrote:

John Machin wrote:
 Martin, I can't guess the reason for this last suggestion; why should
 a Windows system use iso-8859-1 instead of cp1252?

Windows users often think that windows-1252 is the same thing as
iso-8859-1, and then exchange data in windows-1252, but declare them
as iso-8859-1 (in particular, this is common for HTML files).
iso-8859-1 is more portable than windows-1252, so it should be
preferred when the data need to be exchanged across systems.

Martin, it seems I'm still a long way short of enlightenment; please
bear with me:

Terminology disambiguation: what I call users wouldn't know what
'cp1252' and 'iso-8859-1' were. They're not expected to know. They
just type in whatever characters they can see on their keyboard or
find in the charmap utility. It's what I'd call 'admins' and
'developers' who should know better, but often don't.

1. When exchanging data across systems, should not utf-8 be
preferred???

2. If the Windows *users* have been using characters that are in
cp1252 but not in iso-8859-1, then attempting to convert to iso-8859-1
will cause an exception. 

 euro_win = chr(128)
 euro_uc = euro_win.decode('cp1252')
 euro_uc
u'\u20ac'
 unicodedata.name(euro_uc)
'EURO SIGN'
 euro_iso = euro_uc.encode('iso-8859-1')
Traceback (most recent call last):
  File stdin, line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u20ac'
in position 0: ordinal not in range(256)


I find it a bit hard to imagine that the euro sign wouldn't get a fair
bit of usage in Swedish data processing even if it's not their own
currency.

3. How portable is a character set that doesn't include the euro sign?

Regards,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Trouble saving unicode text to file

2005-05-07 Thread Svennglenn
I'm working on a program that is supposed to save
different information to text files.

Because the program is in swedish i have to use
unicode text for ÅÄÖ letters.

When I run the following testscript I get an error message.

# -*- coding: cp1252 -*-

titel = åäö
titel = unicode(titel)

print Titel type, type(titel)

fil = open(testfil.txt, w)
fil.write(titel)
fil.close()


Traceback (most recent call last):
  File D:\Documents and
Settings\Daniel\Desktop\Programmering\aaotest\aaotest2\aaotest2.pyw,
line 5, in ?
titel = unicode(titel)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)


I need to have the titel variable in unicode format because when I
write
åäö in a entry box in Tkinkter it makes the value to a unicode
format
automaticly.

Are there anyone who knows an easy way to save this unicode format text
to a file?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-07 Thread Skip Montanaro

Svennglenn Traceback (most recent call last):
Svennglenn   File D:\Documents and
Svennglenn 
Settings\Daniel\Desktop\Programmering\aaotest\aaotest2\aaotest2.pyw,
Svennglenn line 5, in ?
Svennglenn titel = unicode(titel)
Svennglenn UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in 
position 0:
Svennglenn ordinal not in range(128)

Try:

import codecs

titel = åäö
titel = unicode(titel, iso-8859-1)
fil = codecs.open(testfil.txt, w, iso-8859-1)
fil.write(titel)
fil.close()

Skip
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-07 Thread John Machin
On 7 May 2005 14:22:56 -0700, Svennglenn [EMAIL PROTECTED]
wrote:

I'm working on a program that is supposed to save
different information to text files.

Because the program is in swedish i have to use
unicode text for ÅÄÖ letters.

program is in Swedish: to the extent that this means names of
variables are in Swedish, this is quite irrelevant. The variable
names could be in some other language, like Slovak, Slovenian, Swahili
or Strine. Your problem(s) (PLURAL) arise from the fact that your text
data is in Swedish, the representation of which uses a few non-ASCII
characters. Problem 1 is the representation of Swedish in text
constants in your program; this is causing the exception you show
below but curiously didn't ask for help with.


When I run the following testscript I get an error message.

# -*- coding: cp1252 -*-

titel = åäö
titel = unicode(titel)

You should use titel = uåäö
Works, and saves wear  tear on your typing fingers.


print Titel type, type(titel)

fil = open(testfil.txt, w)
fil.write(titel)
fil.close()


Traceback (most recent call last):
  File D:\Documents and
Settings\Daniel\Desktop\Programmering\aaotest\aaotest2\aaotest2.pyw,
line 5, in ?
titel = unicode(titel)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 0:
ordinal not in range(128)


I need to have the titel variable in unicode format because when I
write
åäö in a entry box in Tkinkter it makes the value to a unicode
format
automaticly.

The general rule in working with Unicode can be expressed something
like work in Unicode all the time i.e. decode legacy text as early as
possible; encode into legacy text (if absolutely required) as late as
possible (corollary: if forced to communicate with another
Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
cp666)

Applying this to Problem 1 is, as you've seen, trivial: To the extent
that you have text constants at all in your program, they should be in
Unicode.

Now after all that, Problem 2: how to save Unicode text to a file?

Which raises a question: who or what is going to read your file? If a
Unicode-aware application, and never a human, you might like to
consider encoding the text as utf-16. If Unicode-aware app plus
(occasional human developer or not CJK and you want to save space),
try utf-8. For general use on Windows boxes in the Latin1 subset of
the universe, you'll no doubt want to encode as cp1252. 


Are there anyone who knows an easy way to save this unicode format text
to a file?

Read the docs of the codecs module -- skipping over how to register
codecs, just concentrate on using them.

Try this:

# -*- coding: cp1252 -*-
import codecs
titel = uåäö
print Titel type, type(titel)
f1 = codecs.open('titel.u16', 'wb', 'utf_16')
f2 = codecs.open('titel.u8', 'w', 'utf_8')
f3 = codecs.open('titel.txt', 'w', 'cp1252')
# much later, maybe in a different function
# maybe even in a different module
f1.write(titel)
f2.write(titel)
f3.write(titel)
# much later
f1.close()
f2.close()
f3.close()

Note: doing it this way follows the encode as late as possible rule
and documents the encoding for the whole file, in one place. Other
approaches which might use the .encode() method of Unicode strings and
then write the 8-bit-string results at different times and in
different functions/modules are somewhat less clean and more prone to
mistakes.

HTH,
John
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trouble saving unicode text to file

2005-05-07 Thread Ivan Van Laningham
Hi All--

John Machin wrote:
 
 
 The general rule in working with Unicode can be expressed something
 like work in Unicode all the time i.e. decode legacy text as early as
 possible; encode into legacy text (if absolutely required) as late as
 possible (corollary: if forced to communicate with another
 Unicode-aware system over an 8-bit wide channel, encode as utf-8, not
 cp666)
 

+1 QOTW

And true, too.

i-especially-like-the-cp666-part-ly y'rs,
Ivan
--
Ivan Van Laningham
God N Locomotive Works
http://www.andi-holmes.com/
http://www.foretec.com/python/workshops/1998-11/proceedings.html
Army Signal Corps:  Cu Chi, Class of '70
Author:  Teach Yourself Python in 24 Hours
-- 
http://mail.python.org/mailman/listinfo/python-list