On Sun, Apr 17, 2011 at 03:52:14PM -0700, Philip Guenther wrote: > On Sun, Apr 17, 2011 at 1:56 AM, Stefan Sperling <[email protected]> wrote: > > On Tue, Apr 05, 2011 at 12:25:37AM +0200, Stefan Sperling wrote: > >> For isprint() to work correctly in a UTF-8 locale applications must > >> set up the LC_CTYPE locale before using isprint(). > >> > >> As done for hexdump and tcpdump already. > >> > >> This diff covers all offenders in usr.bin. > > > > Todd Miller suggested using LC_ALL instead of LC_CTYPE so we don't have > > to revisit these when we add support for more categories. > > Has there been any sort of audit of how this will affect scripts > distributed with the system?
No. > I seem to recall having to fix various > scripts elsewhere when Redhat started shipping with LANG=en_US.UTF-8 > set in the default shell environment, things like character ranges in > grep patterns changing behavior. The scripts didn't handle any sort > of non-C locale correctly; the change in default just made that > incredibly obvious. The quick fix is to apply the hammer and pass > LC_ALL=C, but that's overkill when all you really want is to force > LC_CTYPE=C behavior. Is this really a problem we need to be concerned about? We're far from making a UTF-8 locale the default locale, if ever. Our default locale is C, with the addition of latin1 semantics for some functions like isprint(). It has been ever since the initial citrus bits were imported in 2005. So it's already a little bit more than ASCII. > Hmm... > > export LC_CTYPE=C > [[ -n $LC_ALL ]] && export LC_COLLATE=$LC_ALL LC_TIME=$LC_ALL \ > LC_NUMERIC=$LC_ALL LC_MONETARY=$LC_ALL LC_MESSAGES=$LC_ALL > > A quick grep of /etc/*/* on a Redhat box finds *lots* of LANG=C > settings on commands. Wheee.... Setting LANG=C for system scripts makes sense to me if the default locale is something else. I suppose their problems are exacerbated by having implemented LC_NUMERIC and LC_TIME, which can change output of programs like bc(1) and date(1). We don't implement those. Regarding LC_CTYPE, properties of characters in the range scripts care about remain the same in C, ISO8859-1, and UTF-8 locales. In fact, the UTF-8 locale is now stricter than our default locale because it does not even consider single-byte characters above 0x7f valid, just like C probably should.
