Travis Vitek wrote:


Travis Vitek wrote:


Martin Sebor wrote:
My only requirement is to get those tests to pass in a reasonable
amount of time (i.e., without timing out), and without compromising
their effectiveness.

 > Do
 > we want to give up on the locale name matching, or do we want to
include
 > zh_CN in the list of locales to test? What about matching the
encoding?
 > Should we ignore all of this and just find one locale for each value
of
 > MB_CUR_MAX from 1 to MB_LEN_MAX and run the test on them?

Maybe. I'll let you propose what makes the most sense to you :)

Martin

Well, the AIX I'm testing on has 683 installed locale files. Of those,
many are links to locales with different names. For example, we have

    $ locale -a | grep "_CN" | grep -v "\."
    ZH_CN
    Zh_CN
    zh_CN
    $ ls -l /usr/lib/nls/loc/ZH_CN
lrwxrwxrwx 1 bin bin bin 28 Feb 8 2008 /usr/lib/nls/loc/ZH_CN -> /usr/lib/nls/loc/ZH_CN.UTF-8
    $ ls -l /usr/lib/nls/loc/Zh_CN
lrwxrwxrwx 1 bin bin bin 28 Feb 8 2008 /usr/lib/nls/loc/ZH_CN -> /usr/lib/nls/loc/Zh_CN.GB18030
    $ ls -l /usr/lib/nls/loc/zh_CN
lrwxrwxrwx 1 bin bin bin 28 Feb 8 2008 /usr/lib/nls/loc/ZH_CN -> /usr/lib/nls/loc/zh_CN.IBM-eucCN

The locales that are mapped to [ZH_CN.UTF-8, Zh_CN.GB18030,
zh_CN.IBM-eucCN] also appear in the locale list, so we have many
duplicated locales. So, for an immediate reduction in the number of tested
locales, we could eliminate these duplicates. How to tell if a locale is a
duplicate? I'm not sure.

Another option would be to ignore all locales that don't match the regular
expression "[a-z][a-z]_[A-Z][A-Z]([EMAIL PROTECTED])?$" or the fnmatch 
expressions
"[a-z][a-z]_[A-Z][A-Z]" and "[EMAIL PROTECTED]". The C/POSIX
locales don't match this, but we can explicitly allow them.

This alone cuts the number of locales down significantly, though it does
affect other platforms. Here is a small table showing the total number of
locales, and the number of locales that match the above regular
expression.

        Okay Total
AIX      226   603
Compaq    33    40
HP-UX    142   160
Irix      39    60
Linux    479   582
Solaris  223   331

Another option is to build up a list of all installed locales [their names
and other properties], and then provide a mechanism to search through, or
iterate over that list. If you want to run a test on all locales that have
a name matching some expression, you write a function or function object
to return true on match. You pass that to the rw_locales_match() routine,
and it gives you the first match. Call again to get the next match or
null.

    for (const rw_locale_entry* e = rw_locales_match(0, fun);
         e; e = rw_locales_match(e, fun))
    {
    }

If you want to select only locales with mb_cur_max of 4, you either write
a filter, or you explicitly iterate over the list. If we really decide
that it is necessary to write up a SQL type language for selecting
locales, then that system can be implemented on top of this.

Travis


Ah, my primitive scheme above isn't quite good enough. The time to run the
22.locale.ctype.is test was 28m35s, and I've reduced it down to 6m28s with
an 11s build on AIX. The test would have timed out at 5 minutes.

Because we're still testing far too many locales...


Now that I've seen that, it makes me wonder about the other proposal and the
SQL-like query string idea. If we get a locale from the system, we don't
have access to the original data that was in the ASCII source file. We just
get the data presented from the C/C++ locale. This means that we have to
discover information about the locale [like the mb_cur_max value]. This may
take considerable time.

Maybe we should start by putting together a comprehensive list
of locales installed on all our systems and their properties:

  std-country, std-lang, std-codeset, native-name, aliases, MB_CUR_MAX,
  ...(anything else of interest)...

std-country (ISO-3166):
  http://www.iso.org/iso/english_country_names_and_code_elements

std-lang (ISO-639):
  http://www.loc.gov/standards/iso639-2/php/English_list.php

std-codeset:
  http://www.iana.org/assignments/character-sets

We could then create a database mapping each of the set of standard
names to the platform-specific names on every supported OS. We'd
also need the ability to ask for one out of a set of options (i.e.,
give me a locale that corresponds to en_US.UTF-8 if it exist, or
else en_US.ISO-8859-1, or an en_US locale in any encoding, or if
even that's not available, anything you've got in English ;-)

Martin

Reply via email to