Re: Please do not use en_US.UTF-8 outside the US
Glenn Maynard wrote: What's being suggested is that locales be generated per-region/language; eg. tell the system to generate tr_TR, and then be able to use all relevant encodings (ISO-8859-9 and UTF-8 and whatever else is convertable). Case mappings, collation rules, translation text and so on can be stored in Unicode and converted at runtime, probably still caching common encodings for speed. Seems like a nice, but naive, idea. If such a simple, generic solution was possible, I'd imagine it would have been done already. Windows NT did that in 1993. Exactly what you describe. Sorry. Antoine -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Sun, Oct 20, 2002 at 12:06:32AM +0200, Antoine Leca wrote: What's being suggested is that locales be generated per-region/language; eg. tell the system to generate tr_TR, and then be able to use all relevant encodings (ISO-8859-9 and UTF-8 and whatever else is convertable). Case mappings, collation rules, translation text and so on can be stored in Unicode and converted at runtime, probably still caching common encodings for speed. Seems like a nice, but naive, idea. If such a simple, generic solution was possible, I'd imagine it would have been done already. Windows NT did that in 1993. Exactly what you describe. Sorry. Sorry? I don't even see how this is relevant. NT and POSIX i18n is completely different, so just because NT can do it doesn't mean it's practical here. If you have a point, please say it; I can't even tell whether you agree with the idea (which is not my own) or not. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On 1 May 2002, H. Peter Anvin wrote: In the former case, I would like to propose a worldwide compromise page size -- 210 x 279 mm. Such a page can be printed, cleanly, on either on A4 (210 x 297 mm) or US-letter (216 x 279 mm) by expanding either the horizontal (US-letter) or vertical (A4) margin. G. Do we need to rehash the whole !$#%#$%!# argument? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Fri, 18 Oct 2002 [EMAIL PROTECTED] wrote: On 1 May 2002, H. Peter Anvin wrote: In the former case, I would like to propose a worldwide compromise page size -- 210 x 279 mm. Such a page can be printed, cleanly, on either on A4 (210 x 297 mm) or US-letter (216 x 279 mm) by expanding either the horizontal (US-letter) or vertical (A4) margin. G. Do we need to rehash the whole !@$#@%#$%!@# argument? Indeed no, I went too far back when Thomas restarted this thread; now lets shuddup and hope nobody noticed. :-) -- Rob. (Robert de Bath robert$ @ debath.co.uk) http://www.cix.co.uk/~mayday Google Homepage: http://www.google.com/search?btnIq=Robert+de+Bath -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Thanks that a Sun engineer responds to the problem here. keld wrote: ISO 15897 also has some fallback rules. I think that could be extended in some way, so that you may specify more locales to chose from, like it is done with accept-language: in http. I think some software already does this. Current glibc supports ISO 15897, but that support is going to be removed, as far as I know. ?? This is again just stupid. Ienup Sung [EMAIL PROTECTED] wrote: I just would like to point out that we started with en_US.UTF-8 and ko.UTF-8 at Solaris 2.6 back then 1996 or so. Since then, we've been gradually and also consistently increasing the number of Unicode/UTF-8 locales and that's our goal, i.e., try to supply as many as Unicode/UTF-8 locales as our (limited) resource allows. Also, as the locale name specifies, the en_US.UTF-8 is a locale for American English at the States. We have never even tried to pursuade anyone to use the locale as the only solution; we are also quite surprised that people have seen it that way. As an additional evidence, in Solaris 9, we have: ... [lots of locales] My point is actually that it is a wrong strategy to handle it by increasing the number of UTF-8 locales. There should basically be no such thing as an UTF-8-bound locale. The convention of using LC_CTYPE to specify both locale and encoding is OK if these are handled separately. A generic solution is required, as also Keld argued. Ienup Sung [EMAIL PROTECTED] continued: Regarding the different number of locales for the same Solaris release systems, the reason is when you install/upgrade your system, you might have chose only those locales. I.e., probably your system admin or jump start installation specified or selected during the installation/upgrade or during the preparation of the jump start installation script. -- This is not everyone wants to have all the locales that we have to offer and so we show what kind of locales are available during the installation/upgrade that can be selected as needed. One can, by the way, always add locales to an existing systems after the installation and one way is specified in the following web pages: http://www.sun.com/developers/gadc/faq/sol8.html Sorry, this is not true. One cannot do it, only the system administrator can do it. Please also consider the following response: From: Glenn Maynard [EMAIL PROTECTED] Admins, with no personal interest in UTF-8 and few users using it, are likely to only generate legacy locales, and not enable UTF-8 ones. This probably isn't any particular desire *not* to have it; they just don't know the difference (and shouldn't need to). So, even though my system and terminal is UTF-8, and all of the systems I connect to are *capable* of it, only a few actually have the locale available. This is a senseless hurdle to using UTF-8; I have to nag admins to generate UTF-8 locales, even though all of the software I'm using has already been updated to handle it! Long before UTF-8 can ever be the default encoding everywhere, it needs to be *available* everywhere (without root intervention). This is a problem on Debian, at least. It shows a list of locale names; you only get UTF-8 if you ask for it. It should probably show a list of country/language codes; eg. choosing en_US should generate both en_US (ISO-8859-1) and en_US.UTF-8, unless the user specifically asks for UTF-8 to not be generated. Ienup Sung [EMAIL PROTECTED] continued: Actually, we ship all our locales in a single product and so if you've Solaris 8 or later, all the locales are in the Solaris Software 1 of 2 CD. (Translated message files and some locale-specific files and applications for French, Italian, German, Spanish, Swedish, Simplified Chinese, Traditional Chinese, Japanese and Korean are at the Languages CD by the way.) We couldn't do that before S8 simply because there were licensing issues on fonts and some input methods that we couldn't resolve until the S8 timeframe which took a lot of money and time from us. And Glenn Maynard [EMAIL PROTECTED] continued: I suppose one reason this isn't done is because locale generation does take quite a while (maybe 20 seconds per locale on my system). There are probably other, less obvious reasons this isn't done, but I don't know them. One such might be http://bugs.debian.org/99623 ; but that doesn't seem to prevent generating UTF-8 most of the time. Once again, the approach of having to install/generate/whatever locales to support UTF-8 is fundamentally wrong. Just separate locale and encoding recognition (using LC_CTYPE as a common base) and the problem will vanish, that's the only reasonable solution. It needs to be fixed! Thomas Wolff -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 17 Oct 2002 [EMAIL PROTECTED] wrote: It would be yet simpler to eliminate all non-utf-8 locales. This is what RedHat 8.0 does except for CJK for which still legacy encodings are used.(well for zh_CN, GB 18030 is used, which is just another UTF in a sense.) The exclusion of CJK in a switch-over to UTF-8 is very unfortunate (I've been using ko_KR.UTF-8 for over half a year and I really like it) and I hope it'll change soon (see https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=75829) As I wrote many times before, Korean desperately needs UTF-8 and that's why ko_KR.UTF-8 was among the very few UTF-8 locales offered for Solaris and AIX (see Ienup's message.) in mid-1990's. It would be simpler, but since the vast majority of the world is still using legacy locales, it's irrelevant. Come back in 5-10 years, maybe; I'm talking about things that can be done today. They could still be available, but they would not be the default (legacy encodings) When you setup a new machine, its not front-loaded with scads of text file docs you care about; you will add things as you go. If you recieve new messages (email,documents,etc) they would all be converted to something you can read normally. All you care about is that it is well integrated and it works. I totally agree with you. Jungshik -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Fri, Oct 18, 2002 at 11:26:08AM +0200, Thomas Wolff wrote: Thanks that a Sun engineer responds to the problem here. keld wrote: ISO 15897 also has some fallback rules. I think that could be extended in some way, so that you may specify more locales to chose from, like it is done with accept-language: in http. I think some software already does this. Current glibc supports ISO 15897, but that support is going to be removed, as far as I know. ?? This is again just stupid. I am not sure what you mean is stupid. I would like to see loocale support in a generic way, just as you described it, with the tables stored in 10646 and then the individual charmaps applied. I am not sure how to do this in an efficient way, tho. Kind regards Keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Keld wrote: On Fri, Oct 18, 2002 at 11:26:08AM +0200, Thomas Wolff wrote: keld wrote: ISO 15897 also has some fallback rules. I think that could be extended in some way, so that you may specify more locales to chose from, like it is done with accept-language: in http. I think some software already does this. Current glibc supports ISO 15897, but that support is going to be removed, as far as I know. ?? This is again just stupid. I am not sure what you mean is stupid. Sorry, of course I meant it's stupid if glibc is going to remove support of generic handling. I would like to see locale support in a generic way, just as you described it, with the tables stored in 10646 and then the individual charmaps applied. I am not sure how to do this in an efficient way, tho. Why should it be inefficient to separate encoding tables from locale handling? Kind regards, Thomas -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
I also think the current POSIX global/single locale model is limiting and we do need to have MT-safe, multi-locale APIs. I know this may not be agreed upon by everyone but I believe we need to have some form of locales for each and every region/country at least one since it is not really possible to have a single, unified and universal cultural convention and language/writing system data in a cost effective and manageable manner at the moment even in the LC_CTYPE category data. And in that reasoning, having as many Unicode/UTF-8 locales are not so bad idea at all in my opinion (well, at least at the locale instance level not the locale definition source level). (In doing so, one could also use a template for a rather rapid population of Unicode locales and also by sharing as many common locale definitions as posssible.) Yes, you're correct and what I meant by the One can always add any locales after the Solaris installation at my previous email was that people can add their locales by being root or asking someone who can be a root (i.e., sys admin in many cases). Obviously, for security reason alone, we wouldn't be able to make/allow the locale addition/removal by any and everyone even though it's possible to do by providing an utility that will do setuid to root before the locale installation or something similar like that in my opinion. With regards, Ienup PS. One possible example why it's difficult to have a single, unified LC_CTYPE is like the following: In Turkish, in my understanding, the (simple) case conversion goes like the following: From case To case I (U+0049) dotless i (U+0131) i (U+0069) I with dot above (U+0130) I with dot above (U+0130) i (U+0069) dotless i (U+0131) I (U+0049) But in others, it usually goes like the following: From case To case I (U+0049) i (U+0069) i (U+0069) I (U+0049) I with dot above (U+0130) i (U+0069) dotless i (U+0131) I (U+0049) for obvious reasons (and also due to some limitations we have at POSIX). ] Date: Fri, 18 Oct 2002 11:26:08 +0200 (MEST) ] From: Thomas Wolff [EMAIL PROTECTED] ] Subject: Re: Please do not use en_US.UTF-8 outside the US ] To: [EMAIL PROTECTED], Ienup Sung [EMAIL PROTECTED] ] ] Thanks that a Sun engineer responds to the problem here. ] ] keld wrote: ] ISO 15897 also has some fallback rules. I think that could be ] extended in some way, so that you may specify more locales to ] chose from, like it is done with accept-language: in http. ] I think some software already does this. Current glibc supports ] ISO 15897, but that support is going to be removed, as far as I know. ] ?? This is again just stupid. ] ] ] Ienup Sung [EMAIL PROTECTED] wrote: ] ] I just would like to point out that we started with en_US.UTF-8 and ko.UTF-8 ] at Solaris 2.6 back then 1996 or so. Since then, we've been gradually and also ] consistently increasing the number of Unicode/UTF-8 locales and that's our ] goal, i.e., try to supply as many as Unicode/UTF-8 locales as our (limited) ] resource allows. ] ] Also, as the locale name specifies, the en_US.UTF-8 is a locale for American ] English at the States. We have never even tried to pursuade anyone to ] use the locale as the only solution; we are also quite surprised that people ] have seen it that way. ] ] As an additional evidence, in Solaris 9, we have: ] ] ... [lots of locales] ] My point is actually that it is a wrong strategy to handle it by ] increasing the number of UTF-8 locales. There should basically ] be no such thing as an UTF-8-bound locale. The convention of ] using LC_CTYPE to specify both locale and encoding is OK if these ] are handled separately. ] A generic solution is required, as also Keld argued. ] ] ] Ienup Sung [EMAIL PROTECTED] continued: ] ] Regarding the different number of locales for the same Solaris release ] systems, the reason is when you install/upgrade your system, you might have ] chose only those locales. I.e., probably your system admin or jump start ] installation specified or selected during the installation/upgrade or ] during the preparation of the jump start installation script. ] -- This is not everyone wants to have all the locales that we have to offer ] and so we show what kind of locales are available during the ] installation/upgrade that can be selected as needed. ] ] One can, by the way, always add locales to an existing systems after the ] installation and one way is specified in the following web pages: ] ] http://www.sun.com/developers/gadc/faq/sol8.html ] Sorry, this is not true. ] One cannot do it, only the system administrator can do it. ] ] Please also consider the following response: ] ] From: Glenn Maynard [EMAIL PROTECTED] ] ] Admins
Re: Please do not use en_US.UTF-8 outside the US
[EMAIL PROTECTED] wrote on 2002-10-16 14:48 UTC: I came across this older mail by Markus: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. Some older Solaris documentation suggested that this is the only UTF-8 locale you'll ever need, as locales don't change much sensible beyond the encoding anyway. This is not the case any more today! The problem is that on many Sun installations, en_US.UTF-8 is the only UTF-8 locale available at all! I can't reproduce this problem report on our current Suns: $ uname -a ; locale -a | grep UTF-8 SunOS piper 5.8 Generic_108528-12 sun4u sparc SUNW,Ultra-4 en_US.UTF-8 fr.UTF-8 fr_FR.UTF-8 fr_FR.UTF-8@euro de.UTF-8 es.UTF-8 it.UTF-8 ja_JP.UTF-8 ko.UTF-8 sv.UTF-8 zh.UTF-8 zh_TW.UTF-8 It is slightly unpleasant that there is no Commonwealth en.UTF-8 or British en_GB.UTF-8, but as long as you use en_US only in LC_CTYPE and not in LANG, your are usually fairly safe from the terror of US cultural conventions. A decent solution to this problem would be to handle basic locale information (en_US) and encoding suffix (UTF-8) separately and specifiy that ANY available locale can be suffixed with ANY known encoding, so installed de, gb, whatever locales could always be run with UTF-8. Is anything specified anywhere about this? http://www.opengroup.org/onlinepubs/007904975/functions/setlocale.html In principle, you could set LANG=de LC_CTYPE=en_US.UTF-8 However in practictice, if de is for ISO 8859-1, then it will contain only collating data for ISO 8859-1 and therefore work not as well as if you had taken the collating data from a full UTF-8 locale that comes with all the necessary data. Therefore, in practice, the locales that you mix with LC_* should preferably come with identical encodings. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: http://www.cl.cam.ac.uk/~mgk25/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Wed, Oct 16, 2002 at 04:48:15PM +0200, [EMAIL PROTECTED] wrote: I came across this older mail by Markus: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. Some older Solaris documentation suggested that this is the only UTF-8 locale you'll ever need, as locales don't change much sensible beyond the encoding anyway. This is not the case any more today! The problem is that on many Sun installations, en_US.UTF-8 is the only UTF-8 locale available at all! A decent solution to this problem would be to handle basic locale information (en_US) and encoding suffix (UTF-8) separately and specifiy that ANY available locale can be suffixed with ANY known encoding, so installed de, gb, whatever locales could always be run with UTF-8. Is anything specified anywhere about this? Perhaps someone might nag Sun to fix this broken thing. Yes, this is actually what is specified in ISO/IEV 15897, that makes rules no how to name POSIX locales, amongst other things. You can find it on the WG20 projects page, N610 I belive Kind regards Keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 17 Oct 2002, Thomas Wolff wrote: wolfffscce14:~ uname -a ; locale -a | grep UTF-8 SunOS fscce14 5.8 Generic_108528-12 sun4us sparc FJSV,GPUSK en_US.UTF-8 sv.UTF-8 sv_SE.UTF-8 sv_SE.UTF-8euro In principle, you could set LANG=de LC_CTYPE=en_US.UTF-8 OK, I get: wolfffscce14:~ LANG=de LC_CTYPE=en_US.UTF-8 /bin/sh couldn't set locale correctly couldn't set locale correctly That's probably because you don't have 'de' locale installed. Have you tried 'LANG=sv_SE.UTF-8' if Swedish is all right with you? If that's the case, you don't have to set LC_CTYPE to en_US.UTF-8. Or, you can unset LANG and set other LC_* as you wish. LC_CTYPE=en_US.UTF-8 or sv_SE.UTF-8 (character classification, collation and so forth would behave differently) LC_MESSAGES=C (if just plain English is better for you than localized messages) LC_TIME=C (again, just want plain old Unix/Posix behavior) . I want an LC_* setting that tells my applications to use UTF-8 and doesn't affect the system inappropriately otherwise, and that works with SunOS and doesn't let /bin/sh choke! I don't know why Sun doesn't ship its Solaris with all the locales supported by Solaris. Perhaps, a marketing ploy :-) DEC (now Compaq and should it HP by now?) Digital Unix 4.x (now Tru64) came with all the locales on OS CD-ROM. It's up to the system administrator which locale is installed. Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, Oct 17, 2002 at 05:06:29PM +0200, Thomas Wolff wrote: Markus Kuhn wrote: A decent solution to this problem would be to handle basic locale information (en_US) and encoding suffix (UTF-8) separately and specifiy that ANY available locale can be suffixed with ANY known encoding, so installed de, gb, whatever locales could always be run with UTF-8. Is anything specified anywhere about this? http://www.opengroup.org/onlinepubs/007904975/functions/setlocale.html I think that the formulation If the string does not correspond to a valid locale, setlocale() shall return a NULL pointer and the international environment is not changed. is as stupid as it could be since it imposes an all or nothing locale matching strategy. I don't see why aspects that are handled independently should be tied together this way. Even more, one would expect decent fallback behaviour, e.g. mapping en_GB to en where en_GB is not available etc. How can this be changed? ISO 15897 also has some fallback rules. I think that could be extended in some way, so that you may specify more locales to chose from, like it is done with accept-language: in http. I think some software already does this. Current glibc supports ISO 15897, but that support is going to be removed, as far as I know. Best regards keld -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, Oct 17, 2002 at 03:24:48PM +0100, Markus Kuhn [MK] wrote: TW The problem is that on many Sun installations, en_US.UTF-8 is the only UTF-8 locale available at all! MK I can't reproduce this problem report on our current Suns: Unfortunately I am able to reproduce the problem on our new Sun Blade 100s: silbepuppis:~$ uname -a ; locale -a | grep UTF-8 SunOS puppis 5.8 Generic_108528-13 sun4u sparc en_US.UTF-8 silbepuppis:~$ CU/Lnx Sascha Registered Linux User #77587 (http://counter.li.org/) bomb terrorist afghanistan PGP encrypt CIA FBI BND MAD StaSi anschlag strike sex pussy xxx kill bj hitler msg03344/pgp0.pgp Description: PGP signature
Re: Please do not use en_US.UTF-8 outside the US
I just would like to point out that we started with en_US.UTF-8 and ko.UTF-8 at Solaris 2.6 back then 1996 or so. Since then, we've been gradually and also consistently increasing the number of Unicode/UTF-8 locales and that's our goal, i.e., try to supply as many as Unicode/UTF-8 locales as our (limited) resource allows. Also, as the locale name specifies, the en_US.UTF-8 is a locale for American English at the States. We have never even tried to pursuade anyone to use the locale as the only solution; we are also quite surprised that people have seen it that way. As an additional evidence, in Solaris 9, we have: system% uname -a SunOS aal 5.9 Generic sun4u sparc SUNW,Ultra-5_10 system% locale -a | grep UTF-8 | sort -u ar_EG.UTF-8 de.UTF-8 de_DE.UTF-8 de_DE.UTF-8@euro en_US.UTF-8 es.UTF-8 es_ES.UTF-8 es_ES.UTF-8@euro fi_FI.UTF-8 fr.UTF-8 fr_BE.UTF-8 fr_BE.UTF-8@euro fr_FR.UTF-8 fr_FR.UTF-8@euro he_IL.UTF-8 hi_IN.UTF-8 it.UTF-8 it_IT.UTF-8 it_IT.UTF-8@euro ja_JP.UTF-8 ko.UTF-8 ko_KR.UTF-8 ko_KR.UTF-8@dict pl.UTF-8 pl_PL.UTF-8 pt_BR.UTF-8 ru.UTF-8 ru_RU.UTF-8 sv.UTF-8 sv_SE.UTF-8 sv_SE.UTF-8@euro th_TH.UTF-8 tr_TR.UTF-8 zh.UTF-8 zh_CN.UTF-8 zh_CN.UTF-8@pinyin zh_CN.UTF-8@radical zh_CN.UTF-8@stroke zh_HK.UTF-8 zh_HK.UTF-8@radical zh_HK.UTF-8@stroke zh_TW.UTF-8 zh_TW.UTF-8@pinyin zh_TW.UTF-8@radical zh_TW.UTF-8@stroke zh_TW.UTF-8@zhuyin With regards, Ienup ] Date: Thu, 17 Oct 2002 15:24:48 +0100 ] From: Markus Kuhn [EMAIL PROTECTED] ] Subject: Re: Please do not use en_US.UTF-8 outside the US ] To: [EMAIL PROTECTED] ] MIME-version: 1.0 ] ] [EMAIL PROTECTED] wrote on 2002-10-16 14:48 UTC: ] I came across this older mail by Markus: ] ] General warning: Please do not use the locale name en_US.UTF-8 anywhere ] outside North America. Some older Solaris documentation suggested that ] this is the only UTF-8 locale you'll ever need, as locales don't change ] much sensible beyond the encoding anyway. This is not the case any more ] today! ] ] The problem is that on many Sun installations, en_US.UTF-8 is the ] only UTF-8 locale available at all! ] ] I can't reproduce this problem report on our current Suns: ] ] $ uname -a ; locale -a | grep UTF-8 ] SunOS piper 5.8 Generic_108528-12 sun4u sparc SUNW,Ultra-4 ] en_US.UTF-8 ] fr.UTF-8 ] fr_FR.UTF-8 ] fr_FR.UTF-8@euro ] de.UTF-8 ] es.UTF-8 ] it.UTF-8 ] ja_JP.UTF-8 ] ko.UTF-8 ] sv.UTF-8 ] zh.UTF-8 ] zh_TW.UTF-8 ] ] It is slightly unpleasant that there is no Commonwealth en.UTF-8 or ] British en_GB.UTF-8, but as long as you use en_US only in LC_CTYPE and ] not in LANG, your are usually fairly safe from the terror of US cultural ] conventions. ] ] A decent solution to this problem would be to handle basic locale ] information (en_US) and encoding suffix (UTF-8) separately and ] specifiy that ANY available locale can be suffixed with ANY known ] encoding, so installed de, gb, whatever locales could always be ] run with UTF-8. ] Is anything specified anywhere about this? ] ] http://www.opengroup.org/onlinepubs/007904975/functions/setlocale.html ] ] In principle, you could set ] ] LANG=de LC_CTYPE=en_US.UTF-8 ] ] However in practictice, if de is for ISO 8859-1, then it will contain ] only collating data for ISO 8859-1 and therefore work not as well as if ] you had taken the collating data from a full UTF-8 locale that comes ] with all the necessary data. Therefore, in practice, the locales that ] you mix with LC_* should preferably come with identical encodings. ] ] Markus ] ] -- ] Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK ] Email: mkuhn at acm.org, WWW: http://www.cl.cam.ac.uk/~mgk25/ ] ] -- ] Linux-UTF8: i18n of Linux on all levels ] Archive: http://mail.nl.linux.org/linux-utf8/ ] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
I suppose one reason this isn't done is because locale generation does (B take quite a while (maybe 20 seconds per locale on my system). There (B are probably other, less obvious reasons this isn't done, but I don't (B know them. One such might be http://bugs.debian.org/99623 ; but that (B doesn't seem to prevent generating UTF-8 most of the time. (B (BIt would be yet simpler to eliminate all non-utf-8 locales. (BOr even take encoding out of locale alltogether, and have locale (Bspecify things like collation, date formats, gettext strings, (Betc. (then encoding always is utf-8) (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, Oct 17, 2002 at 07:06:52PM -0400, [EMAIL PROTECTED] wrote: I suppose one reason this isn't done is because locale generation does take quite a while (maybe 20 seconds per locale on my system). There are probably other, less obvious reasons this isn't done, but I don't know them. One such might be http://bugs.debian.org/99623 ; but that doesn't seem to prevent generating UTF-8 most of the time. It would be yet simpler to eliminate all non-utf-8 locales. It would be simpler, but since the vast majority of the world is still using legacy locales, it's irrelevant. Come back in 5-10 years, maybe; I'm talking about things that can be done today. -- Glenn Maynard -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
(B It would be yet simpler to eliminate all non-utf-8 locales. (B (B It would be "simpler", but since the vast majority of the world is still (B using legacy locales, it's irrelevant. Come back in 5-10 years, maybe; (B I'm talking about things that can be done today. (B (BThey could still be available, but they would not be the default (B(legacy encodings) (B (BWhen you setup a new machine, its not front-loaded with scads (Bof text file docs you care about; you will add things as you go. (BIf you recieve new messages (email,documents,etc) they would (Ball be converted to something you can read normally. All you care (Babout is that it is well integrated and it works. (B (BI dont see why its such a hangup. (B-- (BLinux-UTF8: i18n of Linux on all levels (BArchive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
I came across this older mail by Markus: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. Some older Solaris documentation suggested that this is the only UTF-8 locale you'll ever need, as locales don't change much sensible beyond the encoding anyway. This is not the case any more today! The problem is that on many Sun installations, en_US.UTF-8 is the only UTF-8 locale available at all! A decent solution to this problem would be to handle basic locale information (en_US) and encoding suffix (UTF-8) separately and specifiy that ANY available locale can be suffixed with ANY known encoding, so installed de, gb, whatever locales could always be run with UTF-8. Is anything specified anywhere about this? Perhaps someone might nag Sun to fix this broken thing. Thomas Wolff -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, May 02, 2002 at 11:31:03PM +1000, Roger So wrote: On Thu, 2002-05-02 at 08:15, Keld Jørn Simonsen wrote: The nice thing about LC_PAPER is that it is set either on installation, or as part of the normal setup. I think most people knows how to set the locale, while some, maybe many, would not know that there be a /etc/papersize file. Of course, on Debian, there's no need to know about /etc/papersize -- everything is done through Debconf. If Debconf is set to use an interactive UI, a simple dpkg-reconfigure libpaperg will provide a list of paper sizes for users to choose. But users still need to know the package name for the libpaper package. Perhaps a configlet would help... Sorry for going way off topic here. ;) You beat me here :) To those thinking that remembering the command is not any better than finding out which file to edit, it should be noted that 1) most people will just automatically get the dialog at install-time 2) GUIs are in the work so that the admin can easily reconfigure the parts of the system that he wants to. -- Yann Dirson[EMAIL PROTECTED] |Why make M$-Bill richer richer ? Debian-related: [EMAIL PROTECTED] | Support Debian GNU/Linux: Pro:[EMAIL PROTECTED] | Freedom, Power, Stability, Gratuity http://ydirson.free.fr/| Check http://www.debian.org/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Tue, 30 Apr 2002 21:27:13 -0500 David Starner [EMAIL PROTECTED] wrote: On Tue, Apr 30, 2002 at 09:30:46PM -0400, Jungshik Shin wrote: Debian should support LC_PAPER locale category instead of relying on /etc/papersize. psutil (psresize, psnup,etc) relies on it to pick the default paper size. If it's set to en_US.xxx, it uses letter *by default*. Otherise, it uses A4 *by default*. And Debian psutils uses /etc/papersize (actually libpaperg, a wrapper around that one line file), since that's the Debian way. The locale method doesn't make sense, as what size paper the printer has is not usually a user setting. Also, the locale system has two alternatives, letter or A4, ignoring the possiblity that someone might have, say, legal size paper loaded. If /etc/papersize is specific to Debian, how would developers consistently detect something like paper size? I think using LC_PAPER should be enough to satisfy most programs. The rest should be handled by an external program like qtcups. That can consult /etc/papersize in Debian or whatever the LSB is recommending. Mike -- May The Source be with you. -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Yann Dirson [EMAIL PROTECTED] writes: The problem here is satisfying users, not programs. Papersize is a setting that is specific to available printers, not to any locale one may use. What should the default paper size be in an application that creates PDF documents? Or, for another example: What if one has both a letter tray and an A4 tray in the available printer? /Lars -- Lars Engebretsen, PhD, [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Wed, May 01, 2002 at 02:27:27PM +0200, Lars Engebretsen wrote: Yann Dirson [EMAIL PROTECTED] writes: The problem here is satisfying users, not programs. Papersize is a setting that is specific to available printers, not to any locale one may use. What should the default paper size be in an application that creates PDF documents? Maybe both /etc/papersize and LC_PAPER have use. But their field of application has to be clearly defined for programmers to do the right thing... Or, for another example: What if one has both a letter tray and an A4 tray in the available printer? Not sure. One is probably default printer anyway. -- Yann Dirson[EMAIL PROTECTED] |Why make M$-Bill richer richer ? Debian-related: [EMAIL PROTECTED] | Support Debian GNU/Linux: Pro:[EMAIL PROTECTED] | Freedom, Power, Stability, Gratuity http://ydirson.free.fr/| Check http://www.debian.org/ -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Followup to: [EMAIL PROTECTED] By author:Lars Engebretsen [EMAIL PROTECTED] In newsgroup: linux.utf8 What should the default paper size be in an application that creates PDF documents? Or, for another example: What if one has both a letter tray and an A4 tray in the available printer? In the latter case you can use either -- most printers will pick the appropriate tray depending on the input. Use whichever one is more appropriate for your locale (letter in the U.S., A4 in Sweden, for example.) In the former case, I would like to propose a worldwide compromise page size -- 210 x 279 mm. Such a page can be printed, cleanly, on either on A4 (210 x 297 mm) or US-letter (216 x 279 mm) by expanding either the horizontal (US-letter) or vertical (A4) margin. -hpa -- [EMAIL PROTECTED] at work, [EMAIL PROTECTED] in private! Unix gives you enough rope to shoot yourself in the foot. http://www.zytor.com/~hpa/puzzle.txt[EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
Markus Kuhn [EMAIL PROTECTED] writes: As we are talking about en_US.UTF-8: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. Why can't you use it for LC_CTYPE and LC_MESSAGES, say? Determining paper size by locale is rather strange. What's next? Keyboard layout? Mouse orientation? Monitor size? -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Thu, 2 May 2002, Keld Jørn Simonsen wrote: The nice thing about LC_PAPER is that it is set either on installation, or as part of the normal setup. I think most people knows how to set the locale, while some, maybe many, would not know that there be a /etc/papersize file. Yes, I've been bitten more than once by these 'hidden' files lurking around in /etc that affect the way programs work. LC_PAPER was in 14652 at some time but was taken out, because some people thought that it was not useful :-( So, my memory was not telling me a lie. I was almost sure I had seen it in ISO 14652 when I wrote that LC_PAPER is in ISO 14652. Later when I checked it, it's not there, which led me to believe that my memory didn't serve me right once more. Anyway, what's the plan of ISO/IEC JTC1/SC22/WG20 on this? Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Please do not use en_US.UTF-8 outside the US
On Tue, 30 Apr 2002, Markus Kuhn wrote: As we are talking about en_US.UTF-8: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. practice, but it requires that if you explain to an international audience how to activate UTF-8 locales, you should better use a non-US/ CA locale. (en_GB.UTF-8 for instance seems like an excellent choice ... :) % find xc -name *UTF-8* -print xc/nls/Compose/en_US.UTF-8.ct xc/nls/Compose/en_US.UTF-8 xc/nls/XLC_LOCALE/en_US.UTF-8 xc/nls/XLC_LOCALE/en_US.UTF-8.lt xc/nls/XI18N_OBJS/en_US.UTF-8 xc/exports/lib/locale/en_US.UTF-8 Given that en_US.UTF-8 is the only instance of a locale file with UTF-8 in its name, how do I find the names of other locales which use UTF-8 ? -- Dr. Andrew C. Aitchison Computer Officer, DPMMS, Cambridge [EMAIL PROTECTED] http://www.dpmms.cam.ac.uk/~werdna -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Please do not use en_US.UTF-8 outside the US
How are you, Markus, I just would like to point out that we never suggested that the en_US.UTF-8 is the only locale that you will ever need. On the contrary, we've been pointing out that each region/country should use their own Unicode locales. Yes, it is absolutely right that globalization isn't just encoding or coded character set; Unicode itself alone cannot resolve everything/issues even though it is absolutely a good thing to have a universal character set widely accepted like Unicode. With regards, Ienup ] Date: Tue, 30 Apr 2002 21:32:39 +0100 ] From: Markus Kuhn [EMAIL PROTECTED] ] Subject: [I18n]Please do not use en_US.UTF-8 outside the US ] To: [EMAIL PROTECTED] ] Cc: [EMAIL PROTECTED] ] MIME-version: 1.0 ] ] As we are talking about en_US.UTF-8: ] ] General warning: Please do not use the locale name en_US.UTF-8 anywhere ] outside North America. Some older Solaris documentation suggested that ] this is the only UTF-8 locale you'll ever need, as locales don't change ] much sensible beyond the encoding anyway. This is not the case any more ] today! ] ] An increasing number of programs of US origin finally start to abandon ] the annoying old habit of assuming Legal paper and non-metric units as ] default conventions everywhere, requiring 95% of the world population to ] figure out how to reconfigure to the standard conventions. ] ] More recent software releases instead determine the default setting for ] conventions such as paper format and units of measurement with code ] similar to the following (feel free to copy it into your software as ] well): ] ] ] #include stdio.h ] #include stdlib.h ] #include string.h ] ] /* LC_PAPER and LC_MEASUREMENT were introduced in ISO/IEC TR 14652 */ ] ] int main() ] { ] char *units = mm; ] char *paper = A4; ] char *s; ] ] if (((s = getenv(LC_ALL))*s) || ] ((s = getenv(LC_PAPER)) *s) || ] ((s = getenv(LANG)) *s)) ] if (strstr(s, _US) || strstr(s, _CA)) ] paper = Letter; ] if (((s = getenv(LC_ALL))*s) || ] ((s = getenv(LC_MEASUREMENT)) *s) || ] ((s = getenv(LANG)) *s)) ] if (strstr(s, _US)) ] units = inches; ] ] printf(Paper: %s\nUnits: %s\n, paper, units); ] ] return 0; ] } ] ] ] This leads to portable and agreeable default settings, using the ] standard values UNLESS you are in a locale that explicitely says that ] you are in North America. I think that's a very good implementation ] practice, but it requires that if you explain to an international ] audience how to activate UTF-8 locales, you should better use a non-US/ ] CA locale. (en_GB.UTF-8 for instance seems like an excellent choice ... :) ] ] Markus ] ] -- ] Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK ] Email: mkuhn at acm.org, WWW: http://www.cl.cam.ac.uk/~mgk25/ ] ] ___ ] I18n mailing list ] [EMAIL PROTECTED] ] http://XFree86.Org/mailman/listinfo/i18n -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Tue, Apr 30, 2002 at 09:32:39PM +0100, Markus Kuhn wrote: This leads to portable and agreeable default settings, using the standard values Paper size depends on the physical size of paper in the printer, not anything having to do with the locale. In Debian, /etc/papersize holds the default papersize. It seems entirely likely that an American abroad might use en_US.UTF-8, whereas a Hindu in American might want to use hi_IN.UTF-8, no matter what the printer sitting beside them holds. UNLESS you are in a locale that explicitely says that you are in North America. I think that's a very good implementation practice, but it requires that if you explain to an international audience how to activate UTF-8 locales, you should better use a non-US/ CA locale. (en_GB.UTF-8 for instance seems like an excellent choice ... :) Why? You should use a locale with appropriate settings, and any explanation should include that point. It's certainly no better to encourage everyone to use en_GB.UTF-8 than to use en_US.UTF-8. -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: [I18n]Please do not use en_US.UTF-8 outside the US
On Tue, 30 Apr 2002, Dr Andrew C Aitchison wrote: On Tue, 30 Apr 2002, Markus Kuhn wrote: As we are talking about en_US.UTF-8: General warning: Please do not use the locale name en_US.UTF-8 anywhere outside North America. practice, but it requires that if you explain to an international audience how to activate UTF-8 locales, you should better use a non-US/ CA locale. (en_GB.UTF-8 for instance seems like an excellent choice ... :) % find xc -name *UTF-8* -print xc/nls/Compose/en_US.UTF-8.ct Given that en_US.UTF-8 is the only instance of a locale file with UTF-8 in its name, how do I find the names of other locales which use UTF-8 ? Have you looked into the Glibc locale directory? Mandrake has a bunch of UTF-8 locales there, I believe. Glibc 2.2.x has been supporting ll_CC.UTF-8's for a while. If your system doesn't have it, you can just generate whatever ll_CC.UTF-8's you may need with localedef. As for XLC_LOCALE, you can always make one as I wrote in my message yesterday. RedHat and Mandrake Linux may not have XLC_LOCALES for locales other than en_US.UTF-8, but some other Linux distributions (e.g. TurboLinux) have zh_CN.UTF-8 and zh_TW.UTF-8. BTW, the first UTF-8 locale other than en_US.UTF-8 shipped with Solaris - Solaris 7? - (and AIX 4.x as well) was ko_KR.UTF-8, IIRC. a bit off-topic Now I'm almost done with switching to ko_KR.UTF-8 on my Linux box. It works more or less fine in that I can do *more than* what I could do under ko_KR.EUC-KR. Still missing is Middle Korean support, but it seems that xterm-16x can be used to *display* Middle Korean text encoded with a sequence of U+1100 Hangul Conjoining Jamos (http://chem.skku.ac.kr/~wkpark/screenshot/2002_04_30_221718_shot.png). Vim 6.1 already supports up to two combining characters and Middle Korean only need 'two combining characters' *most of time*. (even modern Korean needs more than two 'combining characters' in some cases,though. http://jshin.net/i18n/uyeo.html). Hopefully, with a little more tweaking in Vim 6.1 and some major enhancements in Korean XIM (e.g. Ami), I'll be able to typeset Middle Korean with LaTeX sooner or later. (LaTeX side is almost ready, too) /a bit off-topic Jungshik Shin -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Tue, Apr 30, 2002 at 09:30:46PM -0400, Jungshik Shin wrote: Debian should support LC_PAPER locale category instead of relying on /etc/papersize. psutil (psresize, psnup,etc) relies on it to pick the default paper size. If it's set to en_US.xxx, it uses letter *by default*. Otherise, it uses A4 *by default*. And Debian psutils uses /etc/papersize (actually libpaperg, a wrapper around that one line file), since that's the Debian way. The locale method doesn't make sense, as what size paper the printer has is not usually a user setting. Also, the locale system has two alternatives, letter or A4, ignoring the possiblity that someone might have, say, legal size paper loaded. -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
Re: Please do not use en_US.UTF-8 outside the US
On Tue, Apr 30, 2002 at 11:09:55PM -0400, Jungshik Shin wrote: However, to me overiding the default at the command line is a perfectly good solution. Everytime you use a program? Stuff like that gets real tiring, real fast to me. -- David Starner - [EMAIL PROTECTED] It's not a habit; it's cool; I feel alive. If you don't have it you're on the other side. - K's Choice (probably referring to the Internet) -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/