Re: (Simple?) Unicode Question

2009-08-30 Thread Nobody
On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote:

 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.
 
 Did I mention UTF-8?
 
 Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?

I don't think I've ever seen a terminal (whether an emulator running on a
PC or a hardware terminal) which supports anything like the entire Unicode
repertoire, along with right-to-left writing, complex scripts, etc. Even
support for double-width characters is uncommon.

If your terminal can't handle anything outside of ISO-8859-1, there isn't
any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix
tty driver will delete the last *byte* from the input buffer when you
press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard).

Historically, terminal I/O has tended to revolve around unibyte encodings,
with everything except the endpoints being encoding-agnostic. Anything
which falls outside of that is a dog's breakfast; it's no coincidence
that the word for messed-up text (arising from an encoding mismatch)
was borrowed from Japanese (mojibake).

Life is simpler if you can use a unibyte encoding. Apart from anything
else, the failure modes tend to be harmless. E.g. you get the wrong glyph
rather than two glyphs where you expected one. On a 7-bit channel, you get
the wrong printable character rather than a control character (this is why
ISO-8859-* reserves \x80-\x9F as control codes rather than using them as
printable characters).

 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.
 
 I never mentioned Unicode font either. In any case, there's no reason 
 why a skillful designer can't make a single font which covers the entire 
 Unicode range in a consistent style.

Consistency between unrelated scripts is neither realistic nor
desirable.

E.g. Latin fonts tend to use uniform stroke widths unless they're
specifically designed to look like handwriting, whereas Han fonts tend to
prefer variable-width strokes which reflect the direction.

 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.
 
 Surely the main advantage of Unicode is that it gives you a full and 
 consistent range of characters not limited to the 128 characters provided 
 by ASCII?

Nothing stops you from using other encodings, or from using multiple
encodings. But using multiple encodings means keeping track of the
encodings. This isn't impossible, and it may produce better results (e.g.
no information loss from Han unification), but it can be a lot more work.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Thorsten Kampe
* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?

Python needs to know if you are processing the text.
 
 I may be wrong, but I believe that's part of the idea between separation  
 of string and bytes types in Python 3.x. I believe, if you are using  
 Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)

Nothing has changed in that regard. You still need to decode and encode 
text and for that you have to know the encoding.

Thorsten
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano
On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote:

 * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700)
  Further, does anything, except a printing device need to know the
  encoding of a piece of text?
 
 Python needs to know if you are processing the text.

Python only needs to know when you convert the text to or from bytes. I 
can do this:

 s = hello
 t = world
 print(' '.join([s, t]))
hello world

and not need to care anything about encodings.

So long as your terminal has a sensible encoding, and you have a good 
quality font, you should be able to print any string you can create.



 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.

You only need to worry about encoding when you convert from bytes to 
text, and visa versa. Admittedly, the most common time you need to do 
that is when reading input from files, but if all your text strings are 
generated by Python, and not output anywhere, you shouldn't need to care 
about encodings.

If all your text contains nothing but ASCII characters, you should never 
need to worry about encodings at all.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Nobody
On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:

 Python only needs to know when you convert the text to or from bytes. I 
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good 
 quality font, you should be able to print any string you can create.

UTF-8 isn't a particularly sensible encoding for terminals.

And Unicode font is an oxymoron. You can merge a whole bunch of fonts
together and stuff them into a TTF file; that doesn't make them a font,
though.

 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo jumbo
 at all ;-)
 
 Nothing has changed in that regard. You still need to decode and encode
 text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to 
 text, and visa versa. Admittedly, the most common time you need to do 
 that is when reading input from files, but if all your text strings are 
 generated by Python, and not output anywhere, you shouldn't need to care 
 about encodings.

Why would you generate text strings and not output them anywhere?

The main advantage of using Unicode internally is that you can associate
encodings with the specific points where data needs to be converted
to/from bytes, rather than having to carry the encoding details around the
program.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-29 Thread Steven D'Aprano
On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote:

 On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote:
 
 Python only needs to know when you convert the text to or from bytes. I
 can do this:
 
 s = hello
 t = world
 print(' '.join([s, t]))
 hello world
 
 and not need to care anything about encodings.
 
 So long as your terminal has a sensible encoding, and you have a good
 quality font, you should be able to print any string you can create.
 
 UTF-8 isn't a particularly sensible encoding for terminals.

Did I mention UTF-8?

Out of curiosity, why do you say that UTF-8 isn't sensible for terminals?


 And Unicode font is an oxymoron. You can merge a whole bunch of fonts
 together and stuff them into a TTF file; that doesn't make them a
 font, though.

I never mentioned Unicode font either. In any case, there's no reason 
why a skillful designer can't make a single font which covers the entire 
Unicode range in a consistent style.


 I may be wrong, but I believe that's part of the idea between
 separation of string and bytes types in Python 3.x. I believe, if you
 are using Python 3.x, you don't need the character encoding mumbo
 jumbo at all ;-)
 
 Nothing has changed in that regard. You still need to decode and
 encode text and for that you have to know the encoding.
 
 You only need to worry about encoding when you convert from bytes to
 text, and visa versa. Admittedly, the most common time you need to do
 that is when reading input from files, but if all your text strings are
 generated by Python, and not output anywhere, you shouldn't need to
 care about encodings.
 
 Why would you generate text strings and not output them anywhere?

Who knows? It doesn't matter -- the point is that you can if you want to. 
You only need to worry about encodings at input and output, therefore 
logically if you don't do I/O you can process strings all day long and 
never worry about encodings at all.


 The main advantage of using Unicode internally is that you can associate
 encodings with the specific points where data needs to be converted
 to/from bytes, rather than having to carry the encoding details around
 the program.

Surely the main advantage of Unicode is that it gives you a full and 
consistent range of characters not limited to the 128 characters provided 
by ASCII?



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-27 Thread Rami Chowdhury

Further, does anything, except a printing device need to know the
encoding of a piece of text?


I may be wrong, but I believe that's part of the idea between separation  
of string and bytes types in Python 3.x. I believe, if you are using  
Python 3.x, you don't need the character encoding mumbo jumbo at all ;-)


If you're using Python 2.x, though, I believe if you simply set the file  
opening mode to binary then data you read() should still be treated as an  
array of bytes, although you may encounter issues trying to access the  
n'th character.


Please do correct me if I'm wrong, anyone.

On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh  
shashank.sunny.si...@gmail.com wrote:



Hi All!

I have a very simple (and probably stupid) question eluding me.
When exactly is the char-set information needed?

To make my question clear consider reading a file.
While reading a file, all I get is basically an array of bytes.

Now suppose a file has 10 bytes in it (all is data, no metadata,
forget the BOM and stuff for a little while). I read it into an array of  
10

bytes, replace, say, 2nd bytes and write all the bytes back to a new
file.

Do i need the character encoding mumbo jumbo anywhere in this?

Further, does anything, except a printing device need to know the
encoding of a piece of text? I mean, as long as we are not trying
to get a symbolic representation of a text or get ith character
of it, all we need to do is to carry the intended encoding as
an auxiliary information to the data stored as byte array.

Right?

--shashank




--
Rami Chowdhury
Never attribute to malice that which can be attributed to stupidity --  
Hanlon's Razor

408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD)
--
http://mail.python.org/mailman/listinfo/python-list


Re: (Simple?) Unicode Question

2009-08-27 Thread Albert Hopkins
On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote:
 Hi All!
 
 I have a very simple (and probably stupid) question eluding me.
 When exactly is the char-set information needed?
 
 To make my question clear consider reading a file.
 While reading a file, all I get is basically an array of bytes.
 
 Now suppose a file has 10 bytes in it (all is data, no metadata,
 forget the BOM and stuff for a little while). I read it into an array
 of 10
 bytes, replace, say, 2nd bytes and write all the bytes back to a new
 file. 
 
 Do i need the character encoding mumbo jumbo anywhere in this?
 
 Further, does anything, except a printing device need to know the
 encoding of a piece of text? I mean, as long as we are not trying
 to get a symbolic representation of a text or get ith character
 of it, all we need to do is to carry the intended encoding as
 an auxiliary information to the data stored as byte array.

If you are just reading and writing bytes then you are just reading and
writing bytes.  Where you need to worry about unicode, etc. is when you
start treating a series of bytes as TEXT (e.g. how many *characters* are
in this byte array).* 

This is no different, IMO, than treating a byte stream vs a image file.
You don't, need to worry about resolution, palette, bit-depth, etc. if
you are only treating as a stream of bytes.  The only difference between
the two is that in Python unicode is a built-in type and image
isn't ;)

* Just make sure that if you are manipulating byte streams independent
of it's textual representation that you open files, e.g., in binary
mode.

-a


-- 
http://mail.python.org/mailman/listinfo/python-list