Re: (Simple?) Unicode Question
On Sun, 30 Aug 2009 02:36:49 +, Steven D'Aprano wrote: So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? I don't think I've ever seen a terminal (whether an emulator running on a PC or a hardware terminal) which supports anything like the entire Unicode repertoire, along with right-to-left writing, complex scripts, etc. Even support for double-width characters is uncommon. If your terminal can't handle anything outside of ISO-8859-1, there isn't any advantage to using UTF-8, and some disadvantages; e.g. a typical Unix tty driver will delete the last *byte* from the input buffer when you press backspace (Linux 2.6.* has the IUTF8 flag, but this is non-standard). Historically, terminal I/O has tended to revolve around unibyte encodings, with everything except the endpoints being encoding-agnostic. Anything which falls outside of that is a dog's breakfast; it's no coincidence that the word for messed-up text (arising from an encoding mismatch) was borrowed from Japanese (mojibake). Life is simpler if you can use a unibyte encoding. Apart from anything else, the failure modes tend to be harmless. E.g. you get the wrong glyph rather than two glyphs where you expected one. On a 7-bit channel, you get the wrong printable character rather than a control character (this is why ISO-8859-* reserves \x80-\x9F as control codes rather than using them as printable characters). And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. Consistency between unrelated scripts is neither realistic nor desirable. E.g. Latin fonts tend to use uniform stroke widths unless they're specifically designed to look like handwriting, whereas Han fonts tend to prefer variable-width strokes which reflect the direction. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? Nothing stops you from using other encodings, or from using multiple encodings. But using multiple encodings means keeping track of the encodings. This isn't impossible, and it may produce better results (e.g. no information loss from Han unification), but it can be a lot more work. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
* Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. Thorsten -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 09:34:43 +0200, Thorsten Kampe wrote: * Rami Chowdhury (Thu, 27 Aug 2009 09:44:41 -0700) Further, does anything, except a printing device need to know the encoding of a piece of text? Python needs to know if you are processing the text. Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. If all your text contains nothing but ASCII characters, you should never need to worry about encodings at all. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Sat, 29 Aug 2009 20:09:12 +0100, Nobody wrote: On Sat, 29 Aug 2009 08:26:54 +, Steven D'Aprano wrote: Python only needs to know when you convert the text to or from bytes. I can do this: s = hello t = world print(' '.join([s, t])) hello world and not need to care anything about encodings. So long as your terminal has a sensible encoding, and you have a good quality font, you should be able to print any string you can create. UTF-8 isn't a particularly sensible encoding for terminals. Did I mention UTF-8? Out of curiosity, why do you say that UTF-8 isn't sensible for terminals? And Unicode font is an oxymoron. You can merge a whole bunch of fonts together and stuff them into a TTF file; that doesn't make them a font, though. I never mentioned Unicode font either. In any case, there's no reason why a skillful designer can't make a single font which covers the entire Unicode range in a consistent style. I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) Nothing has changed in that regard. You still need to decode and encode text and for that you have to know the encoding. You only need to worry about encoding when you convert from bytes to text, and visa versa. Admittedly, the most common time you need to do that is when reading input from files, but if all your text strings are generated by Python, and not output anywhere, you shouldn't need to care about encodings. Why would you generate text strings and not output them anywhere? Who knows? It doesn't matter -- the point is that you can if you want to. You only need to worry about encodings at input and output, therefore logically if you don't do I/O you can process strings all day long and never worry about encodings at all. The main advantage of using Unicode internally is that you can associate encodings with the specific points where data needs to be converted to/from bytes, rather than having to carry the encoding details around the program. Surely the main advantage of Unicode is that it gives you a full and consistent range of characters not limited to the 128 characters provided by ASCII? -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
Further, does anything, except a printing device need to know the encoding of a piece of text? I may be wrong, but I believe that's part of the idea between separation of string and bytes types in Python 3.x. I believe, if you are using Python 3.x, you don't need the character encoding mumbo jumbo at all ;-) If you're using Python 2.x, though, I believe if you simply set the file opening mode to binary then data you read() should still be treated as an array of bytes, although you may encounter issues trying to access the n'th character. Please do correct me if I'm wrong, anyone. On Thu, 27 Aug 2009 09:39:06 -0700, Shashank Singh shashank.sunny.si...@gmail.com wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. Right? --shashank -- Rami Chowdhury Never attribute to malice that which can be attributed to stupidity -- Hanlon's Razor 408-597-7068 (US) / 07875-841-046 (UK) / 0189-245544 (BD) -- http://mail.python.org/mailman/listinfo/python-list
Re: (Simple?) Unicode Question
On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote: Hi All! I have a very simple (and probably stupid) question eluding me. When exactly is the char-set information needed? To make my question clear consider reading a file. While reading a file, all I get is basically an array of bytes. Now suppose a file has 10 bytes in it (all is data, no metadata, forget the BOM and stuff for a little while). I read it into an array of 10 bytes, replace, say, 2nd bytes and write all the bytes back to a new file. Do i need the character encoding mumbo jumbo anywhere in this? Further, does anything, except a printing device need to know the encoding of a piece of text? I mean, as long as we are not trying to get a symbolic representation of a text or get ith character of it, all we need to do is to carry the intended encoding as an auxiliary information to the data stored as byte array. If you are just reading and writing bytes then you are just reading and writing bytes. Where you need to worry about unicode, etc. is when you start treating a series of bytes as TEXT (e.g. how many *characters* are in this byte array).* This is no different, IMO, than treating a byte stream vs a image file. You don't, need to worry about resolution, palette, bit-depth, etc. if you are only treating as a stream of bytes. The only difference between the two is that in Python unicode is a built-in type and image isn't ;) * Just make sure that if you are manipulating byte streams independent of it's textual representation that you open files, e.g., in binary mode. -a -- http://mail.python.org/mailman/listinfo/python-list