[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 20/02/2012 07:47, Fabian Groffen wrote: On 20-02-2012 03:07:33 +, Kerin Millar wrote: I know that adding LANG=POSIX doesn't do anything in this case but I have a feeling that its presence would be instructive to new users. If a user is asked to configure something which isn't present, it often generates questions which might otherwise be avoided. I've changed "en_US.UTF-8" to "en_US.utf8" there for similar reasons. I don't understand. UTF-8 is the codeset, that utf8 is recognised as the same thing is IMO a GNUism. glibc understands "UTF-8" perfectly fine these days, so it should preferably be used instead. (Even the man-page, utf8(7), suggests that.) Most users don't read man pages. The rationale was that the user can copy-paste exactly what they see from "locale -a", which might diminish the number of questions asked about it via mainstream support channels, as well as simplifying the instructions in the sample comment. It was just a thought; no big deal. --Kerin
[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 20-02-2012 03:07:33 +, Kerin Millar wrote: > I know that adding LANG=POSIX doesn't do anything in this case but I > have a feeling that its presence would be instructive to new users. If a > user is asked to configure something which isn't present, it often > generates questions which might otherwise be avoided. I've changed > "en_US.UTF-8" to "en_US.utf8" there for similar reasons. I don't understand. UTF-8 is the codeset, that utf8 is recognised as the same thing is IMO a GNUism. glibc understands "UTF-8" perfectly fine these days, so it should preferably be used instead. (Even the man-page, utf8(7), suggests that.) -- Fabian Groffen Gentoo on a different level signature.asc Description: Digital signature
[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 20/02/2012 00:11, William Hubbs wrote: On Sun, Feb 19, 2012 at 11:56:40PM +0800, Ben wrote: On 19 February 2012 23:14, Ulrich Mueller wrote: On Sun, 19 Feb 2012, Ben wrote: In my opinion we should set a default environment with the following values: LANG=en_US.UTF-8 LC_ALL= LC_COLLATE=C This offers the best default options to the majority of users, and is easy to customize for those who wish to use another locale. At least, LC_NUMERIC=C should be added to this, otherwise numbers will be formatted with commas as thousands separators. Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units and letter paper, which isn't optimal for users outside of the U.S. Ulrich I think those users (and that includes myself) should then set LANG to something more appropriate to their use case. According to our localization guide, there is a safe default that forces UTF-8 characters but doesn't force any language. I have the following single line in /etc/env.d/02locale: LC_CTYPE=en_US.UTF-8 That looks good but perhaps it should also define LANG=POSIX, which is similar to Ulrich's proposal. Something like: # To configure for your region, set LANG to an appropriate locale, then comment # or remove LC_CTYPE. Run "locale -a" to obtain a list of available locales. LANG=POSIX LC_CTYPE=en_US.utf8 I know that adding LANG=POSIX doesn't do anything in this case but I have a feeling that its presence would be instructive to new users. If a user is asked to configure something which isn't present, it often generates questions which might otherwise be avoided. I've changed "en_US.UTF-8" to "en_US.utf8" there for similar reasons. Not to mention that, if one is curious and searches for "posix locale" via Google, the first link is for the Open Group specification :) I reckon that this, along with some basic information in the handbook, would be a step in the right direction. --Kerin
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On Sun, Feb 19, 2012 at 11:56:40PM +0800, Ben wrote: > On 19 February 2012 23:14, Ulrich Mueller wrote: > >> On Sun, 19 Feb 2012, Ben wrote: > > > >> In my opinion we should set a default environment with the following > >> values: > > > >> LANG=en_US.UTF-8 > >> LC_ALL= > >> LC_COLLATE=C > > > >> This offers the best default options to the majority of users, and > >> is easy to customize for those who wish to use another locale. > > > > At least, LC_NUMERIC=C should be added to this, otherwise numbers will > > be formatted with commas as thousands separators. > > > > Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units > > and letter paper, which isn't optimal for users outside of the U.S. > > > > Ulrich > > > > I think those users (and that includes myself) should then set LANG to > something more appropriate to their use case. According to our localization guide, there is a safe default that forces UTF-8 characters but doesn't force any language. I have the following single line in /etc/env.d/02locale: LC_CTYPE=en_US.UTF-8 What do you think? William pgpn8ilNNBPSD.pgp Description: PGP signature
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
> On Sun, 19 Feb 2012, Ben wrote: >>> In my opinion we should set a default environment with the >>> following values: >> >>> LANG=en_US.UTF-8 >>> LC_ALL= >>> LC_COLLATE=C >> >>> This offers the best default options to the majority of users, and >>> is easy to customize for those who wish to use another locale. >> >> At least, LC_NUMERIC=C should be added to this, otherwise numbers >> will be formatted with commas as thousands separators. >> >> Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial >> units and letter paper, which isn't optimal for users outside of >> the U.S. > I think those users (and that includes myself) should then set LANG > to something more appropriate to their use case. And why should we set the default to an US locale then? IMHO something like LANG=C LC_CTYPE=en_US.utf8 would be much less intrusive if you just want UTF-8, without influencing other i18n variables. Ulrich
[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 19/02/2012 01:00, James Cloos wrote: "KM" == Kerin Millar writes: KM> Arch also used to define LC_COLLATE="C" by default, probably to KM> mitigate unpredictable behaviour in some applications, but have KM> since dropped this additional variable so they must have deemed it KM> no longer necessary. Without LC_COLLATE="C" things like [a-z]* gets a false=positive match on files like Makefile. Indeed, character classes are a potential minefield. Incidentally, I just tested Ubuntu and Arch with only LANG set to a UTF-8 locale:- $ echo Makefile | sed -re 's/[a-z]//g' # collation rules ignored M $ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored akefile In neither case are the collation rules being obeyed. In Gentoo, however:- $ echo Makefile | sed -re 's/[a-z]//g' # collation rules obeyed $ echo Makefile | grep -Eo '[a-z]*' # collation rules ignored akefile Obeying the collation rules is ostensibly the correct thing to do but, until everyone starts using named character classes (which will never happen), it's not safe. The thing that worries me here is the inconsistency in Gentoo. LC_COLLATE="C" is sufficient to work around the issue but the above makes me wonder why we still need it. --Kerin
[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 19/02/2012 15:56, Ben wrote: On 19 February 2012 23:14, Ulrich Mueller wrote: On Sun, 19 Feb 2012, Ben wrote: In my opinion we should set a default environment with the following values: LANG=en_US.UTF-8 LC_ALL= LC_ALL isn't needed here because, unlike other LC_* settings, it does not inherit from LANG and, thus, will be undefined anyway. Although the above would not directly cause any harm, I am entirely certain that its mere presence would encourage users to explicitly define it where they most definitely should not. The misinformation that LC_ALL should be defined was propagated by the localization doc for rather a long time and it was rather challenging to impress upon its maintainers that change was required. Let's not repeat old mistakes. LC_COLLATE=C This offers the best default options to the majority of users, and is easy to customize for those who wish to use another locale. At least, LC_NUMERIC=C should be added to this, otherwise numbers will be formatted with commas as thousands separators. Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units and letter paper, which isn't optimal for users outside of the U.S. Ulrich I think those users (and that includes myself) should then set LANG to something more appropriate to their use case. I agree; the defaults should not be over-engineered. For proper localisation, set LANG appropriately and done. The real issue is that locale configuration isn't mentioned in the handbook. It does, however, mention locale.gen so we're half-way there. --Kerin
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 19 February 2012 23:14, Ulrich Mueller wrote: >> On Sun, 19 Feb 2012, Ben wrote: > >> In my opinion we should set a default environment with the following >> values: > >> LANG=en_US.UTF-8 >> LC_ALL= >> LC_COLLATE=C > >> This offers the best default options to the majority of users, and >> is easy to customize for those who wish to use another locale. > > At least, LC_NUMERIC=C should be added to this, otherwise numbers will > be formatted with commas as thousands separators. > > Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units > and letter paper, which isn't optimal for users outside of the U.S. > > Ulrich > I think those users (and that includes myself) should then set LANG to something more appropriate to their use case. Ben
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
> On Sun, 19 Feb 2012, Ben wrote: > In my opinion we should set a default environment with the following > values: > LANG=en_US.UTF-8 > LC_ALL= > LC_COLLATE=C > This offers the best default options to the majority of users, and > is easy to customize for those who wish to use another locale. At least, LC_NUMERIC=C should be added to this, otherwise numbers will be formatted with commas as thousands separators. Also en_US.UTF-8 for LC_MEASUREMENT and LC_PAPER means imperial units and letter paper, which isn't optimal for users outside of the U.S. Ulrich
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
Excerpts from Ben's message of 2012-02-19 03:04:19 +0100: > On 19 February 2012 09:00, James Cloos wrote: > > Without LC_COLLATE="C" things like [a-z]* gets a false=positive > > match on files like Makefile. [...] > > > > The real fix is to have root be C.UTF-8. Which differs from C only > > in that the charset is utf-8. > > In my opinion we should set a default environment with the following > values: > > LANG=en_US.UTF-8 > LC_ALL= > LC_COLLATE=C This is only on my setups or this is "xy_XY.utf8" instead of "xy_XY.UTF-8"? -- Amadeusz Żołnowski signature.asc Description: PGP signature
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 19 February 2012 09:00, James Cloos wrote: > Without LC_COLLATE="C" things like [a-z]* gets a false=positive match > on files like Makefile. [...] > > The real fix is to have root be C.UTF-8. Which differs from C only in > that the charset is utf-8. In my opinion we should set a default environment with the following values: LANG=en_US.UTF-8 LC_ALL= LC_COLLATE=C This offers the best default options to the majority of users, and is easy to customize for those who wish to use another locale. And yes, LC_ALL needs to be empty, because it would override the other LC_* values. This should be combined with some good unicode fonts, such as the LatCyrGr-16 for console, and dejavu for X. Cheers, Ben
Re: [gentoo-dev] Re: LANG=en_GB.UTF-8 by default
> "KM" == Kerin Millar writes: KM> Arch also used to define LC_COLLATE="C" by default, probably to KM> mitigate unpredictable behaviour in some applications, but have KM> since dropped this additional variable so they must have deemed it KM> no longer necessary. Without LC_COLLATE="C" things like [a-z]* gets a false=positive match on files like Makefile. I recently noticed a bug on b.g.o where the ebuild has something like doc/[A-Z]* expecting that it will not match doc/some_lowercase_subdir. The bug, of course, is that glibc fraudulently defaults the latin, greek and cyrillic locales to case-insensitive. The real fix is to have root be C.UTF-8. Which differs from C only in that the charset is utf-8. -JimC -- James Cloos OpenPGP: 1024D/ED7DAEA6
[gentoo-dev] Re: LANG=en_GB.UTF-8 by default
On 15/02/2012 12:22, Mr. Aaron W. Swenson wrote: On Wed, Feb 15, 2012 at 12:58:52PM +0100, Francesco R.(vivo) wrote: as subject says could gentoo change the policy and set an UTF-8 environment by default? Perhaps it should define LANG="en_US.UTF-8" as a reasonable default, which would be in line with other notable distros. Arch also used to define LC_COLLATE="C" by default, probably to mitigate unpredictable behaviour in some applications, but have since dropped this additional variable so they must have deemed it no longer necessary. I think that having a default configuration file would also raise awareness of the importance of locale configuration and make it less likely that users configure their systems inappropriately (defining LC_ALL, for instance). P.S. would be nice to have a wd_WD.UTF-8 with WD standing for world, just a country is so 1900 Different countries/regions have different standards and conventions for character classification, case conversion, date/numerical/currency formatting etc. There's no basis on which to formally standardise a world-wide definition. However, the stage 3, last time I used it, didn't default to a UTF-8 environment, and it didn't default to using and/or including a capable UTF-8 font. It is something I think we should look at changing. Yet "unicode" is a default flag in the standard profiles. Most console fonts have poor coverage. The best one I've found thus far is "LatCyrGr-16" from fonty-rg, which provides good Latin and Cyrillic coverage along with some Greek and esoteric punctuation characters. Using this font, I've yet to find any developer's name that doesn't render as expected while perusing the contents of the portage tree. Being a 512 character font, one loses bold support unless using a framebuffer console. Given that the default console fonts aren't especially useful, it seems a small price to pay. --Kerin