svn commit: r448754 [1/6] - /incubator/stdcxx/trunk/util/

sebor Thu, 21 Sep 2006 17:42:57 -0700

Author: sebor
Date: Thu Sep 21 17:42:16 2006
New Revision: 448754

URL: http://svn.apache.org/viewvc?view=rev&rev=448754
Log:
2006-09-21  Martin Sebor  <[EMAIL PROTECTED]>


        * aliases.cpp (get_installed_locales): Redirected locale's stderr
        to /dev/null to prevent it from cluttering our own. Eliminated
        the handling of invalid category values.
        * charmap.h (<list>): Included.
        (portable_charset): Declared.
        (get_n_cmap, get_rn_cmap, get_w_cmap, get_rw_cmap): Documented.
        (get_n_cmap2, get_rn_cmap2, get_strval_map, increment_val): Renamed...
        (get_mb_cmap, get_rmb_cmap, get_symnames_list, increment_wchar): ...to
        this.
        (get_strval_map): Replaced...
        (get_symnames_list): ...with this.
        (process_ellipsis): Changed to return the number of characters denoted
        by the ellipsis instead of void.
        (increment_strval): Renamed...
        (increment_encoding): ...to this for clarity and changed it to return
        true on success and false otherwise.
        (verify_portable_charset): Made const.
        (parse_ext_strval): Renamed...
        (encoding_to_mbchar): ...to this for clarity.
        (add_to_cmaps): Added a bool argument.
        (n_cmap2_, rn_cmap2_): Renamed...
        (mb_cmap_, rmb_cmap_): ...to this for clarity.
        (valid_mb_set_, valid_wchar_set_, strval_map_): Removed unused members.
        (symnames_list_): Added new member.
        * def.h (process_abs_ellipsis): New.
        (process_sym_ellipsis): Returned the number of characters denoted by
        the ellipsis instead of void.
        (hex_increment, dec_increment): Removed (defined as static non-members
        in collate.cpp).
        (process_mask): Added an argument denoting the human readable name
        of the mask for progress diagnostics.
        (process_transliteration, process_transliteration_statement,
        generate_xliteration_data): Renamed...
        (process_xlit, process_xlit_statement, gen_xlit_data): ...to this for
        brevity.
        (extract_string_array): New.
        (ctype_offset_tab_t): Renamed...
        (codecvt_offset_tab_t): ...to this.
        (wchar_off_map_, mb_char_offs_, wchar_offs_, wchar_utf8_to_ext_,
        utf8_offs_, valid_codecvt_utf8_set_, utf8_map_, next_codecvt_tab_num_,
        next_wchar_codecvt_tab_num_, next_utf8_codecvt_tab_num_, codecvt_out_):
        Removed data members.
        (mb_char_offs_iter, wchar_offs_iter, utf8_offs_iter, utf8_map_iter,
        (gen_valid_codecvt_wchar_set, gen_valid_codecvt_utf8_set, gen_utf8_map,
        generate_wchar_codecvt_table, generate_utf8_codecvt_table):
        Replaced member functions...
        (gen_mbchar_tables, gen_wchar_tables, gen_utf8_tables): ...with these.
        * diagnostic.h (E_NOTSUP, W_NOTSUP, W_CHARMAP, I_SKIP): Added.
        (W_COLVAL): Removed.
        * scanner.h (token_id): Arranged enumerators by LC_XXX section and
        documented. Added enumerators for LC_ADDRESS and other sections defined
        by ISO/IEC TR 14652.
        (tok_ellipsis, tok_dellipsis, tok_qellipsis, tok_doub_inc_ellipsis):
        Renamed...
        (tok_abs_ellipsis, tok_hex_ellipsis, tok_dec_ellipsis,
        tok_dbl_ellipsis): ...to this for better correspondence to ISO/IEC
        TR 14652 names.
        (tok_eof, tok_cont_line): Removed unused names.
        (tok_octal_value, tok_decimal_value, tok_hex_value): Replaced...
        (tok_char_value): ...with this for simplicity.
        * charmap.cpp (convert_escape): Removed dead code.
        (portable_charset): Defined.
        (convert_to_wc): Simplified, added comments.
        (increment_val): Renamed...
        (increment_wchar): ...to this, renamed locals for clarity, and replaced
        hardcoded 0xff with UCHAR_MAX (is this really correct when char is more
        that 8 bits wide?).
        (increment_strval): Renamed...
        (increment_encoding): ...to this, removed local statics, and optimized.
        The function is unused since it is not possible to increment a multibyte
        character so that its last byte overflows (i.e., wraps around to 0). It
        should/will be removed in a future commit.
        (parse_ext_strval): Renamed...
        (encoding_to_mbchar): ...to this and renamed locals.
        (convert_sym_to_ucs): Tightened up the checking of symbolic character
        names for validity (expect a hex digit after the leading "<U" prefix).
        Used strtoul instead strtol for simplicity.
        (convert_to_ucs): Adjusted formatting.
        (add_to_cmaps): Added an argument indicating whether the encoding
        is multibyte or whether it should be converted to it. Optimized and
        added inline comments.
        (process_ellipsis): Renamed locals, added inline comments and adjusted
        formatting for clarity. Corrected sprintf format. Avoided calling
        increment_strval and instead incremented the last byte of a multibyte
        character, detecting and diagnosing overflow, for efficiency.
        (process_chars): Added a progress message at the beginning of the
        function. Counted the number of tokens, ellipses, and characters
        processed. Added inline comments. Simplified the handling of numeric
        character values (tok_char_value). Used tok_abs_ellipsis,
        tok_hex_ellipsis, and tok_dec_ellipsis.  Added a progress message
        at the end of the function indicating the number of of tokens, ellipses,
        and characters processed.
        * codecvt.cpp (gen_valid_codecvt_mb_set, gen_valid_codecvt_wchar_set,
        gen_valid_codecvt_utf8_set, generate_codecvt_table, gen_utf8_map,
        generate_wchar_codecvt_table, -generate_utf8_codecvt_table): Replaced...
        (gen_mbchar_tables, gen_wchar_tables, gen_utf8_tables): ...with these
        functions and simplified by replacing member variables with function
        arguments wherever possible.
        (create_wchar_utf8_table): Moved to collate.cpp where the function
        is actually used.       
        (generate_xliteration_data): Renamed...
        (gen_xlit_data): ...to this.
        (write_codecvt): Called gen_mbchar_tables, gen_wchar_tables, and
        gen_utf8_tables for better locality of reference and documented.
        * collate.cpp (lc_name): New.
        (create_wchar_utf8_table): Moved from codecvt.cpp.
        (dec_increment, hex_increment): Made static non-members.
        (process_weights, add_missing_values, process_collate, write_collate):
        Used preincrement instead of postincrement with iterators for better
        efficiency.
        * ctype.cpp (lc_name): New.
        (convert_literal_to_ucs4): Handled tok_char_value instead of
        tok_octal_value, tok_decimal_value, and tok_hex_value.
        (process_abs_ellipsis): New function.
        (process_sym_ellipsis): Simplified.
        (process_mask): Added an argument, displayed progress message, and
        simplified.
        (process_upper_lower): Displayed progress message, and simplified.
        (process_transliteration, process_transliteration_statement): Renamed...
        (process_xlit, process_xlit_statement): ...to this.
        (process_ctype): Added a progress diagnostic.
        * def.cpp (convert_to_ext): Used rmb_cmap_iter.
        (process_input): Removed progress diagnostics.
        (verify_portable_charset): Used portable_charset.
        * diagnostic.cpp (issue_diag): Used fputs instead of fprintf where
        appropriate and simplified.
        * messages.cpp (lc_name): New.
        (process_messages): Used lc_name, issued an informative diagnostic,
        and simplified ever so slightly.
        * monetary.cpp (lc_name): New.
        (process_monetary): Used lc_name. Issued an informative diagnostic.
        * numeric.cpp (lc_name): New.
        (process_numeric): Used lc_name, issued an informative diagnostic,
        and simplified ever so slightly.
        * time.cpp (lc_name): New.
        (parse_era): Handled empty era strings and constified locals.
        (extract_string_array): Defined new member function of class Def.
        (process_time): Used lc_name, issued an informative diagnostic,
        called extract_string_array.
        (write_time): Issued an informative diagnostic and simplified ever
        so slightly.
        * scanner.cpp (ScannerContext): Declared copy ctor and assignment
        operator private and removed definitions.
        (process_token): Simplified the handling of numeric character values,
        removed unnecessary qualification from the definition of tok_map array
        elements.
        (next_token): Used the new names tok_abs_ellipsis, tok_hex_ellipsis,
        tok_dec_ellipsis, and tok_dbl_ellipsis, eliminated a C-style const
        cast to avoid a gcc warning.
        * localedef.cpp (--notes): Documented option.
        * locale.cpp: Declared all definitions static, renamed many locals for
        clarity and/or brevity.
        (collate_charmap, ctype_charmap, time_charmap, num_charmap, mon_charmap,
        messages_charmap, current_locales, is_env_set): Removed globals (moved
        into get_charmap()).
        (sect_charmap, sect_lc_all, sect_lc_collate, sect_lc_ctype, ect_lc_time,
        sect_lc_messages, sect_lc_monetary, sect_lc_numeric): Defined new global
        constants.
        (UChar): New convenience typedef.
        (validate): Tweaked text of error message.
        (initialize_struct): Renamed...
        (init_struct): ...to this for brevity, removed redundant arguments
        and added their static local equivalents.
        (initialize): Renamed...
        (init_section): ...to this and removed redundant arguments.
        (initialize_structs): Renamed...
        (init_sections): ...to this. Called init_section with fewer arguments.
        (get_charmap): Added static locals for better encapsulation and locality
        of reference.
        (escape_value): New function to handling the formatting of characters
        and weights using escape sequences.
        (create_str): Simolified.
        (print_section, print_keyword): New helper functions.
        (print_weight): Simplified.
        (write_coll_info): Renamed locals.
        (print_ce_info): Added comments, adjusted code formatting.
        (print_lc_collate, print_lc_ctype, print_lc_messages, print_lc_numeric,
        print_lc_time, print_lc_all): Printed comments on stdout with additional
        detail about each section. Simplified.
        (print_int): New helper to handle the printing of numeric values.
        (print_time_keyword): New helper to handle the printing of LC_TIME
        keywords.
        (print_mask, print_toupper, print_tolower): Simplified, added inline
        comments, adjusted formatting.
        (print_frac_digits, ...): Called print_int or print_time_keyword
        as appropriate for simplicity and to avoid code duplication.
        (print_help_msg): Renamed...
        (print_help): ...to this.
        (initialize_env_vars): Removed (done in get_charmap).
        (print_locale_name): Removed.
        (print_charmap): Collected character statistics.
        (print_charmap): Added a default argument, computed additional stats.
        (print_lc_var): New function to handle the formatting of localization
        environment variables according to POSIX rules.
        (handlers): New array of command line argument handlers.
        (main): Used handlers. Added -p to force locale to use symbolic
        character names from the Portable Character Set whenever possible.

Modified:
    incubator/stdcxx/trunk/util/aliases.cpp
    incubator/stdcxx/trunk/util/charmap.cpp
    incubator/stdcxx/trunk/util/charmap.h
    incubator/stdcxx/trunk/util/codecvt.cpp
    incubator/stdcxx/trunk/util/collate.cpp
    incubator/stdcxx/trunk/util/ctype.cpp
    incubator/stdcxx/trunk/util/def.cpp
    incubator/stdcxx/trunk/util/def.h
    incubator/stdcxx/trunk/util/diagnostic.cpp
    incubator/stdcxx/trunk/util/diagnostic.h
    incubator/stdcxx/trunk/util/locale.cpp
    incubator/stdcxx/trunk/util/localedef.cpp
    incubator/stdcxx/trunk/util/messages.cpp
    incubator/stdcxx/trunk/util/monetary.cpp
    incubator/stdcxx/trunk/util/numeric.cpp
    incubator/stdcxx/trunk/util/scanner.cpp
    incubator/stdcxx/trunk/util/scanner.h
    incubator/stdcxx/trunk/util/time.cpp

Modified: incubator/stdcxx/trunk/util/aliases.cpp
URL: 
http://svn.apache.org/viewvc/incubator/stdcxx/trunk/util/aliases.cpp?view=diff&rev=448754&r1=448753&r2=448754
==============================================================================
--- incubator/stdcxx/trunk/util/aliases.cpp (original)
+++ incubator/stdcxx/trunk/util/aliases.cpp Thu Sep 21 17:42:16 2006
@@ -38,7 +38,7 @@
 #include <cstdlib>
 #include <cstdio>
 #include <cstring>   // for memcpy(), strlen()
-#include <clocale>
+#include <clocale>   // for setlocale()
 #include <locale>
 #include <iostream>
 #include <string>
@@ -541,10 +541,10 @@
     }
 
     char* locname = slocname;
-    char* save_localename = 0;
 
-    if (loc_cat != int (LC_INVALID_CAT))
-        save_localename = std::setlocale (loc_cat, 0);
+    // save the current locale setting and set the locale to "C"
+    const char* const save_localename = std::setlocale (LC_ALL, 0);
+    std::setlocale (LC_ALL, "C");
 
 #if __GNUG__ == 2 && __GNUC_MINOR__ == 96
 
@@ -582,11 +582,11 @@
     //   sizeof ("locale -a | grep \"\" > ")   // 22
     // + strlen (fname)                        // must be <= L_tmpnam
 
-    char cmd [22 + L_tmpnam];
+    char cmd [80 + L_tmpnam];
 
-    std::sprintf (cmd, "/usr/bin/locale -a > %s 2>/dev/null", fname);
+    std::sprintf (cmd, "LC_ALL=C /usr/bin/locale -a >%s 2>/dev/null", fname);
 
-    int ret = std::system(cmd);
+    const int ret = std::system (cmd);
 
     if (ret && ret != 256) {
         std::strcpy (slocname, "call to system ");
@@ -653,8 +653,10 @@
         *locname = '\0';
 
     }
-    if (loc_cat != int (LC_INVALID_CAT))
-        std::setlocale (loc_cat, save_localename);
+
+    // restore the original locale
+    if (save_localename)
+        std::setlocale (LC_ALL, save_localename);
 
     std::fclose (f);
     std::remove (fname);

Modified: incubator/stdcxx/trunk/util/charmap.cpp
URL: 
http://svn.apache.org/viewvc/incubator/stdcxx/trunk/util/charmap.cpp?view=diff&rev=448754&r1=448753&r2=448754
==============================================================================
--- incubator/stdcxx/trunk/util/charmap.cpp (original)
+++ incubator/stdcxx/trunk/util/charmap.cpp Thu Sep 21 17:42:16 2006
@@ -45,7 +45,7 @@
 #include <clocale>    // for LC_CTYPE, setlocale()
 #include <cstdio>
 #include <cstdlib>
-#include <cstring>    // for strerror()
+#include <cstring>    // for strrchr(), strerror()
 
 #include <map>
 #include <string>
@@ -234,66 +234,164 @@
 }
 
 
-#if 0
-
-// convert the first character of a string to an unsigned char 
-unsigned char Charmap::
-convert_escape (const char *str, const char **pend /* = 0 */) const
+// count the number of bytes in a multibyte sequence denoted
+// by the argument by counting the number of escape characters
+std::size_t Charmap::mbcharlen (const std::string &str) const
 {
-    assert (str != 0);
-
-    if (!*str || *str != scanner_.escape_char ())
-        issue_diag (E_CVT, true, &next,
-                    "unable to convert character %s\n", str);
+    std::size_t count = 1;
 
-    long ch = 0;
+    const char escape = scanner_.escape_char ();
 
-    char *end;
+    for (std::size_t idx = 0; ; ++idx, ++count) {
+        idx = str.find (escape, idx);
 
-    switch (str [1]) {
-    case '%': ch = std::strtol (str + 2, &end, 8); break;
-    case 'd': ch = std::strtol (str + 2, &end, 10); break;
-    case 'x': ch = std::strtol (str + 2, &end, 16); break;
-    default: ch = -1; end = _RWSTD_CONST_CAST (char*, str) + 1;
+        if (std::string::npos == idx)
+            break;
     }
 
-    if (end == str + 2 || (ch < 0 || ch > UCHAR_MAX))
-        issue_diag (E_IFMT, true, &next,
-                    "unable to convert character %s\n", str);
-
-    if (pend)
-        *pend = end;
-
-    typedef unsigned char UChar;
-
-    return UChar (ch);
+    return count;
 }
 
-#endif
 
+/**************************************************************************/
 
-// find the number of bytes in the multibyte string by counting the
-// number of escape chars in the string
-size_t Charmap::mbcharlen (const std::string &str) const
-{
-    std::size_t count = 0;
-
-    std::size_t idx = str.find (scanner_.escape_char ());
-
-    for (; std::string::npos != idx; ++count) {
-        idx = str.find (scanner_.escape_char (), idx + 1);
-    }
-
-    return count;
-}
-
+const char* const Charmap::
+portable_charset[] = {
+    /* 0x00       */ "<NUL>",
+    /* 0x01   SOH */ 0,
+    /* 0x02   STX */ 0,
+    /* 0x03   ETX */ 0,
+    /* 0x04   EOT */ 0,
+    /* 0x05   ENQ */ 0,
+    /* 0x06   ACK */ 0,
+    /* 0x07   BEL */ "<alert>",
+    /* 0x08       */ "<backspace>",
+    /* 0x09   TAB */ "<tab>",
+    /* 0x0a       */ "<newline>", 
+    /* 0x0b       */ "<vertical-tab>",
+    /* 0x0c       */ "<form-feed>",
+    /* 0x0d       */ "<carriage-return>",
+    /* 0x0e   SO  */ 0,
+    /* 0x0f   SI  */ 0,
+    /* 0x10   DLE */ 0,
+    /* 0x11   DC1 */ 0,
+    /* 0x12   DC2 */ 0,
+    /* 0x13   DC3 */ 0,
+    /* 0x14   DC4 */ 0,
+    /* 0x15   NAK */ 0,
+    /* 0x16   SYN */ 0,
+    /* 0x17   ETB */ 0,
+    /* 0x18   CAN */ 0,
+    /* 0x19   EM  */ 0,
+    /* 0x1a   SUB */ 0,
+    /* 0x1b   ESC */ 0,
+    /* 0x1c   IS4 */ 0,
+    /* 0x1d   IS3 */ 0,
+    /* 0x1e   IS2 */ 0,
+    /* 0x1f   IS1 */ 0,
+    /* 0x20   SPC */ "<space>",
+    /* 0x21    !  */ "<exclamation-mark>",
+    /* 0x22    '  */ "<quotation-mark>",
+    /* 0x23    #  */ "<number-sign>", 
+    /* 0x24    $  */ "<dollar-sign>",
+    /* 0x25    %  */ "<percent-sign>",
+    /* 0x26    &  */ "<ampersand>",
+    /* 0x27    '  */ "<apostrophe>",
+    /* 0x28    (  */ "<left-parenthesis>",
+    /* 0x29    )  */ "<right-parenthesis>",
+    /* 0x2a    *  */ "<asterisk>", 
+    /* 0x2b    +  */ "<plus-sign>",
+    /* 0x2c    ,  */ "<comma>",
+    /* 0x2d    -  */ "<hyphen>",   // "<hyphen-minus>",
+    /* 0x2e    .  */ "<period>",   // "<full-stop>",
+    /* 0x2f    /  */ "<slash>",    // "<solidus>",
+    /* 0x30    0  */ "<zero>",
+    /* 0x31    1  */ "<one>",
+    /* 0x32    2  */ "<two>",
+    /* 0x33    3  */ "<three>",
+    /* 0x34    4  */ "<four>",
+    /* 0x35    5  */ "<five>",
+    /* 0x36    6  */ "<six>",
+    /* 0x37    7  */ "<seven>",
+    /* 0x38    8  */ "<eight>",
+    /* 0x39    9  */ "<nine>",
+    /* 0x3a    :  */ "<colon>",
+    /* 0x3b    ;  */ "<semicolon>",
+    /* 0x3c    <  */ "<less-than-sign>", 
+    /* 0x3d    =  */ "<equals-sign>",
+    /* 0x3e    >  */ "<greater-than-sign>",
+    /* 0x3f    ?  */ "<question-mark>",
+    /* 0x40    @  */ "<commercial-at>",
+    /* 0x41    A  */ "<A>",
+    /* 0x42    B  */ "<B>",
+    /* 0x43    C  */ "<C>",
+    /* 0x44    D  */ "<D>",
+    /* 0x45    E  */ "<E>",
+    /* 0x46    F  */ "<F>",
+    /* 0x47    G  */ "<G>",
+    /* 0x48    H  */ "<H>",
+    /* 0x49    I  */ "<I>",
+    /* 0x4a    J  */ "<J>",
+    /* 0x4b    K  */ "<K>",
+    /* 0x4c    L  */ "<L>",
+    /* 0x4d    M  */ "<M>",
+    /* 0x4e    N  */ "<N>",
+    /* 0x4f    O  */ "<O>",
+    /* 0x50    P  */ "<P>",
+    /* 0x51    Q  */ "<Q>",
+    /* 0x52    R  */ "<R>",
+    /* 0x53    S  */ "<S>",
+    /* 0x54    T  */ "<T>",
+    /* 0x55    U  */ "<U>",
+    /* 0x56    V  */ "<V>",
+    /* 0x57    W  */ "<W>",
+    /* 0x58    X  */ "<X>",
+    /* 0x59    Y  */ "<Y>",
+    /* 0x5a    Z  */ "<Z>",
+    /* 0x5b    [  */ "<left-square-bracket>",
+    /* 0x5c    \  */ "<backslash>",    // "<reverse-solidus>",
+    /* 0x5d    ]  */ "<right-square-bracket>",
+    /* 0x5e    ^  */ "<circumflex>",   // "<circumflex-accent>",
+    /* 0x5f    _  */ "<underscore>",   // "<low-line>",
+    /* 0x60    `  */ "<grave-accent>",
+    /* 0x61    a  */ "<a>",
+    /* 0x62    b  */ "<b>",
+    /* 0x63    c  */ "<c>",
+    /* 0x64    d  */ "<d>",
+    /* 0x65    e  */ "<e>",
+    /* 0x66    f  */ "<f>",
+    /* 0x67    g  */ "<g>",
+    /* 0x68    h  */ "<h>",
+    /* 0x69    i  */ "<i>",
+    /* 0x6a    j  */ "<j>",
+    /* 0x6b    k  */ "<k>",
+    /* 0x6c    l  */ "<l>",
+    /* 0x6d    m  */ "<m>",
+    /* 0x6e    n  */ "<n>",
+    /* 0x6f    o  */ "<o>",
+    /* 0x70    p  */ "<p>",
+    /* 0x71    q  */ "<q>",
+    /* 0x72    r  */ "<r>",
+    /* 0x73    s  */ "<s>",
+    /* 0x74    t  */ "<t>",
+    /* 0x75    u  */ "<u>",
+    /* 0x76    v  */ "<v>",
+    /* 0x77    w  */ "<w>",
+    /* 0x78    x  */ "<x>",
+    /* 0x79    y  */ "<y>",
+    /* 0x7a    z  */ "<z>",
+    /* 0x7b    {  */ "<left-brace>",    // "<left-curly-bracket>",
+    /* 0x7c    |  */ "<vertical-line>", 
+    /* 0x7d    }  */ "<right-brace>",   // "<right-curly-bracket>",
+    /* 0x7e    ~  */ "<tilde>",
+    /* 0x7f       */ 0
+};
 
 
 // convert a string of narrow character into a wchar_t
 bool Charmap::convert_to_wc (const std::string& sym_name,
                              const std::string& ext_enc, wchar_t& wc)
 {
-
 #ifndef _RWSTD_NO_ISO_10646_WCHAR_T
 
     // the internal wchar_t representation for all characters
@@ -302,17 +400,14 @@
 
 #else   // if defined _RWSTD_NO_ISO_10646_WCHAR_T
 
-    if (UCS4_internal_) {
+    if (UCS4_internal_ || Clocale_.empty ()) {
 
-        // translate the character to ISO-10646 (UCS)
+        // when using UCS as the internal encoding or for a locale
+        // that has no corresponding C library locale convert the
+        // character to ISO-10646 (UCS)
         return convert_to_ucs (sym_name, ext_enc, wc);
     }
 
-    // for a locale that has no corresponding C library locale
-    // convert the character to ISO-10646 (UCS)
-    if (Clocale_.empty ()) 
-        return convert_to_ucs (sym_name, ext_enc, wc);
-
     // otherwise use libc to convert the multi-byte character
     // to its wchar_t value
     if (-1 == std::mbtowc (&wc, ext_enc.c_str (), ext_enc.size ())) {
@@ -332,6 +427,7 @@
     return true;
 
 #endif   // _RWSTD_NO_ISO_10646_WCHAR_T
+
 }
 
 
@@ -377,153 +473,198 @@
 }
 
 
-wchar_t Charmap::increment_val (const wchar_t val) const
+wchar_t Charmap::increment_wchar (wchar_t val) const
 {
 #ifndef _RWSTD_NO_ISO_10646_WCHAR_T
+
     // to increment a wchar_t value and keep the encoding all we have
-    // to do is increment the val because the internal encoding is utf8
+    // to do is increment the val because the internal encoding is UCS
     return val + 1;
 
 #else
     // to increment a wchar_t value and keep the encoding we have to
     // convert the wchar_t to the external encoding, increment that
     // string value, and convert back to the internal representation
-    rn_cmap2_iter it = rn_cmap2_.find (val);
+    const rmb_cmap_iter it = rmb_cmap_.find (val);
+
+    if (it != rmb_cmap_.end ()) {
 
-    if (it != rn_cmap2_.end ()) {
+        mb_cmap_iter ret;
 
-        n_cmap2_iter ret;
+        // multibyte character corresponding to the wchar_t value
+        std::string encoding = it->second;
 
-        std::string mb_str = it->second;
         // continue incrementing the multi-byte value until we get a valid 
         // character.  NOTE: this must be done for encodings such as SJIS where
         // \x7f in the last byte of a multibyte string is not a valid character
         // NOTE: this will not detect errors in the sequence, since the program
         // will continue until it finds a valid character
         do {
-            int last_elm = mb_str.size() - 1;
+            int last_elm = encoding.size () - 1;
 
             while (last_elm >= 0) {
 
                 typedef unsigned char UChar;
 
-                const unsigned ic = UChar (mb_str [last_elm]) + 1;
+                const unsigned ic = UChar (encoding [last_elm]) + 1;
 
                 // if incrementing the last element caused it to exceed
-                // 0xff increment the next higher byte if there is one
-                if (ic > 0xff)
-                    mb_str [last_elm--] = 0;
+                // UCHAR_MAX increment the next higher byte if there is
+                // one
+                if (UCHAR_MAX < ic)
+                    encoding [last_elm--] = '\0';
                 else {
-                    mb_str [last_elm] = char (ic);
+                    encoding [last_elm] = char (ic);
                     break;
                 }
             }
 
             if (last_elm < 0)
-                std::cerr << "cannot convert character\n";
-        } while ((ret = n_cmap2_.find (mb_str)) == n_cmap2_.end ());
+                return -1;   // error
+
+        } while ((ret = mb_cmap_.find (encoding)) == mb_cmap_.end ());
 
         return ret->second;
     }
 
-    return -1;
+    return -1;   // error
 
 #endif   // _RWSTD_NO_ISO_10646_WCHAR_T    
+
 }
 
-const char* Charmap::increment_strval (const char* str)
+
+bool Charmap::
+increment_encoding (std::string &encoding)
 {
-    static char s  [64];
-    static char sd [64];
-    int         i = 0;
+    // find the last escape character in the human readable representation
+    // of the encoding (i.e., in the multibyte character such as "/xf0/x80")
+    const std::string::size_type pos =
+        encoding.rfind (scanner_.escape_char ());
+
+    // the escape character must be there (guaranteed by the scanner)
+    assert (pos < encoding.size ());
+
+    const char* end = 0;
+
+    // convert the last character in the multibyte character to a numeric
+    // value representing the last byte of the sequence
+    unsigned long last_byte =
+        scanner_.convert_escape (encoding.c_str () + pos, &end);
+
+    // POSIX requires that the incremented value be non-NUL
+    if (UCHAR_MAX <= last_byte || *end)
+        return false;
 
-    char* ps = s;
+    // increment the last byte
+    ++last_byte;
 
-    // zero the first element of string s
-    *ps = 0;
+    // format the last byte in the same notation (octal, decimal,
+    // or hexadecimal escape sequence)
+    static const char xdigits[] = "0123456789ABCDEF";
 
-    if (str == 0 || *str == 0)
-        return s;
+    char byte_str [5];
+    char *pdig = byte_str;
 
-    for (i = 0; *str && i < 64; i++) {
+    switch (encoding [pos + 1]) {
+    case 'd': {   // decimal escape
+        const unsigned hundreds = last_byte / 100;
+        const unsigned tens     = (last_byte - hundreds) / 10;
+        const unsigned units    = last_byte % 10;
 
-        *ps++ = scanner_.convert_escape (str, (const char**)&str);
+        *pdig++ = 'd';
+
+        if (hundreds)
+            *pdig++ = xdigits [hundreds];
+
+        *pdig++ = xdigits [tens];
+        *pdig++ = xdigits [units];
+        *pdig   = '\0';
+        break;
     }
 
-    *ps = 0;
+    case 'x': {   // hex escape
+        const unsigned hi = last_byte >> 4;
+        const unsigned lo = last_byte & 0xfU;
 
-    //  now  attempt to  increment the  last character  in the  string  if the
-    //  character gets incremented above  /xff then we increment the next char
-    if (ps == s) {
-        return s;
+        *pdig++ = 'x';
+        *pdig++ = xdigits [hi];
+        *pdig++ = xdigits [lo];
+        *pdig   = '\0';
+        break;
     }
+    default: {   // octal escape
+        const unsigned hi  = last_byte >> 6;
+        const unsigned mid = (last_byte >> 3) & 07U;
+        const unsigned lo  = last_byte & 07U;
 
-    for (ps--; ps >= s; ps--) 
-        if ((unsigned char)(*ps) == 0xff) {
-            *ps = 0;
-            continue;
-        } else {
-            (*ps)++;
-            break;
-        }
+        if (hi)
+            *pdig++ = xdigits [hi];
 
-    const char fmt [] = {scanner_.escape_char (), 
-                         'x', '%', '0', '2', 'x', '\0'};
-    ps = s;
-    for (i = 0; *ps; i += 4, ps++)
-        std::sprintf (&sd [i], fmt, *(unsigned char*)ps);
+        *pdig++ = xdigits [mid];
+        *pdig++ = xdigits [lo];
+        *pdig   = '\0';
+    }
+    }   // switch
+
+    // replace the last escape sequence with the new one
+    encoding.replace (pos + 1, std::string::npos, byte_str);
 
-    return sd;
+    return true;
 }
 
 
-std::string Charmap::parse_ext_strval (const std::string &strval) const
+std::string Charmap::
+encoding_to_mbchar (const std::string &encoding) const
 {
-    std::string ext_enc;
+    std::string mbchar;
 
-    for (const char *str = strval.c_str (); str && *str; )
-        ext_enc += char (scanner_.convert_escape (str, &str));
+    for (const char *pbyte = encoding.c_str (); pbyte && *pbyte; )
+        mbchar += char (scanner_.convert_escape (pbyte, &pbyte));
 
-    return ext_enc;
+    return mbchar;
 }
 
 
 // convert the locale's encoded character to UCS4 wchar_t
-wchar_t Charmap::convert_sym_to_ucs (const std::string& s) const
+wchar_t Charmap::
+convert_sym_to_ucs (const std::string &sym) const
 {
-    std::string::const_iterator it (s.begin ());
+    std::string::const_iterator it (sym.begin ());
 
-    if (s.size () < 4 || *it != '<' || *++it != 'U') {
-        issue_diag (E_UCS, true, 0, 
-                    "Attempt to convert symbolic name to UCS value failed. "
-                    "Name %s not in <Uxxxxxxxx> form.\n", s.c_str ());
+    if (   sym.size () < 4 || *it != '<' || *++it != 'U'
+        || !isxdigit (*++it)) {
+        issue_diag (E_UCS, true, 0,
+                    "Unable to convert symbolic name %s to UCS.\n",
+                    sym.c_str ());
     }
 
-    long w = std::strtol (&*++it, (char**)0, 16);
-    if (w == _RWSTD_LONG_MIN || w == _RWSTD_LONG_MAX ||
-        w >  _RWSTD_WCHAR_T_MAX )
+    const unsigned long val = std::strtoul (&*++it, (char**)0, 16);
+
+    if (_RWSTD_WCHAR_T_MAX <= val)
         issue_diag (E_UCS, true, 0, 
-                    "Attempt to convert symbolic name to UCS value failed. "
-                    "Value of %s out of range.\n", s.c_str ());
+                    "UCS value %lu of symbolic character %s out of range.\n",
+                    val, sym.c_str ());
 
-    return wchar_t (w);
+    return wchar_t (val);
 }
 
+
 // convert the locale's encoded character to UCS4/UCS2 wchar_t
 bool Charmap::convert_to_ucs (const std::string &sym_name, 
-                              const std::string &ext_enc, wchar_t& wc)
+                              const std::string &encoding, wchar_t& wc)
 {
 #ifndef _MSC_VER
 
     if (in_utf8_) {
-        wc = utf8_decode (ext_enc.c_str (), &*(ext_enc.end () - 1));
+        wc = utf8_decode (encoding.c_str (), &*(encoding.end () - 1));
         return true;
     }
 
     char utf8_enc [_RWSTD_MB_LEN_MAX + 1];
 
     const char* const ch_end =
-        convert_to_utf8  (ext_enc.c_str (), ext_enc.size (),
+        convert_to_utf8  (encoding.c_str (), encoding.size (),
                           utf8_enc, sizeof utf8_enc);
     if (ch_end)
         // only if conversion to utf8 succeeded
@@ -538,7 +679,7 @@
 
     if (0 != codepage_) {
         wchar_t ret[2] = {0};
-        MultiByteToWideChar (codepage_, 0, ext_enc.c_str(), -1, ret, 2);
+        MultiByteToWideChar (codepage_, 0, encoding.c_str(), -1, ret, 2);
         if (ret[1] != 0)
             return false;
 
@@ -556,16 +697,27 @@
 
 
 void Charmap::add_to_cmaps (const std::string &sym_name, 
-                            const std::string &strval)
+                            const std::string &encoding,
+                            bool               is_mbchar /* = false */)
 {
-    strval_map_.insert (std::make_pair (strval, sym_name));
+    // compute the external (multibyte) encoding of the character
+    // if necessary (i.e., unless already done by the caller)
+    const std::string mbchar =
+        is_mbchar ? encoding : encoding_to_mbchar (encoding);
+
+    symnames_list_.push_back (sym_name);
 
-    if (mbcharlen (strval) == 1) {
+    if (1 == mbchar.size ()) {
+        // strval is a single-byte character
 
-        const unsigned char ch = scanner_.convert_escape (strval.c_str ());
+        const unsigned char ch = mbchar [0];
 
-        if (forward_maps)
+        // add the wide character and its symbolic name to the narrow
+        // character maps
+        if (forward_maps) {
+            // the locale utility doesn't need reverse maps
             n_cmap_.insert (std::make_pair (sym_name, ch));
+        }
 
         if (reverse_maps)
             rn_cmap_.insert (std::make_pair (ch, sym_name));
@@ -574,176 +726,204 @@
             largest_nchar_ = ch;
     }
 
-    const std::string ext_enc = parse_ext_strval (strval);
-
-    wchar_t ch;
+    // (try to) compute the wide character value of the character
+    wchar_t wch;
 
-    if (convert_to_wc (sym_name, ext_enc, ch)) {
+    if (convert_to_wc (sym_name, mbchar, wch)) {
 
-        if (forward_maps)
-            w_cmap_.insert (std::make_pair (sym_name, ch));
+        // add the wide character and its symbolic name to the wide
+        // character maps
+        if (forward_maps) {
+            // the locale utility doesn't need forward maps
+            w_cmap_.insert (std::make_pair (sym_name, wch));
+        }
 
         if (reverse_maps)
-            rw_cmap_.insert (std::make_pair (ch, sym_name));
-
-        std::string n_strval = parse_ext_strval (strval);
-
-        n_cmap2_.insert (std::make_pair (n_strval, ch));
-        rn_cmap2_.insert (std::make_pair (ch, n_strval));
-
-        assert (n_strval.size () != 0);
+            rw_cmap_.insert (std::make_pair (wch, sym_name));
 
-        for (std::string::size_type i = n_strval.size (); --i; )
-            valid_mb_set_.insert (n_strval.substr (0, i));
+        // add the corresponding multibyte character to the multibyte
+        // character maps
+        mb_cmap_.insert (std::make_pair (mbchar, wch));
+        rmb_cmap_.insert (std::make_pair (wch, mbchar));
     }
 
-    wchar_t ucs_val;
+    // compute the UCS value of the character
+    wchar_t uch;
 
-    if (convert_to_ucs (sym_name, ext_enc, ucs_val)) {
-        ucs4_cmap_.insert (std::make_pair (sym_name, ucs_val));
-        rucs4_cmap_.insert (std::make_pair (ucs_val, sym_name));
+    if (convert_to_ucs (sym_name, mbchar, uch)) {
+
+        // add UCS character and its symbolic name to the UCS
+        // character maps
+        ucs4_cmap_.insert (std::make_pair (sym_name, uch));
+        rucs4_cmap_.insert (std::make_pair (uch, sym_name));
     }
 }
 
 
 // process the characters implicitly defined by using ellipsis between
 // two explicitly defined characters
-void Charmap::
+std::size_t Charmap::
 process_ellipsis (const Scanner::token_t &beg_tok, int num_ellipsis)
 {
+    // get the upper end of the range denoted by the ellipsis
     const Scanner::token_t end_tok = scanner_.next_token ();
-    std::string strval = scanner_.next_token ().name;
 
-    // first add the beg_tok symbol name to the maps
-    add_to_cmaps (beg_tok.name, strval);
+    // get the human readabale encoding of the character
+    // denoted by the lower end of the ellipsis
+    const std::string encoding = scanner_.next_token ().name;
+
+    // convert the encoding to a multibyte character
+    std::string mbchar = encoding_to_mbchar (encoding);
+
+    // add the beg_tok symbol name to the maps
+    add_to_cmaps (beg_tok.name, mbchar, true);
     
-    // seperate the numeric portion of the symbolic name from the
-    // character portion in order to dynamically create symbolic
-    // names with increasing numeric values
-    std::string begin;
-    size_t idx = 0;
+    // extract the numeric portion of the symbolic character name
+    // denoted by the lower end of the ellipsis
+    std::size_t idx = 0;
 
     int base;           // numeric base
     const char *fmat;   // sprintf() format specifier
 
+    const std::size_t beg_len = beg_tok.name.size ();
+
     // determine the value of the beginning of the range
     // denoted by the ellipsis
-    if (num_ellipsis == 2) {
+    if (2 == num_ellipsis) {
         base = 16;
-        fmat = "%s%0*lX>";
+        fmat = "%.*s%0*lX>";
 
         // advance to the first hex digit
-        while (   idx < beg_tok.name.size ()
-               && !(std::isxdigit)(beg_tok.name [idx]))
-            begin += beg_tok.name [idx++];
+        while (idx < beg_len && !(std::isxdigit)(beg_tok.name [idx]))
+            ++idx;
     }
     else {
         base = 10;
-        fmat = "%s%0*ld>";
+        fmat = "%.*s%0*ld>";
 
         // advance to the first decimal digit
-        while (   idx < beg_tok.name.size ()
-               && !(std::isdigit)(beg_tok.name [idx]))
-            begin += beg_tok.name [idx++];
+        while (idx < beg_len && !(std::isdigit)(beg_tok.name [idx]))
+            ++idx;
     }
 
-    std::string num_str;  // the numeric portion of the sym name
-
-    // get the numeric portion of the sym_name, this is the portion
-    // that will be different for each sym_name within the ellipsis
-    while (idx < beg_tok.name.size () && beg_tok.name [idx] != '>')
-        num_str += beg_tok.name [idx++];
-
-    const int num_size = int (num_str.size ());
-    
-    // convert the num_str to a long
-    unsigned long start_num = std::strtoul (num_str.c_str(), (char**)0, base);
+    // length of non-numeric prefix of the symbolic character name
+    const std::size_t pfx_len = idx;
 
-    // increment the start_num once because we already added the 
-    // beg_tok symbol name to the cmaps
-    ++start_num;
+    // get the character value plus one (since the first value
+    // has already been added to the map earlier)
+    char *num_end;
+    const unsigned long beg_val =
+        1 + std::strtoul (beg_tok.name.c_str () + pfx_len, &num_end, base);
+
+    // the length of the numeric portion
+    const std::size_t num_size =
+        num_end - (beg_tok.name.c_str () + pfx_len);
 
     // find the end of the range denoted by the ellipsis
     idx = 0;
 
-    if (num_ellipsis == 2) {
+    const std::size_t end_len = end_tok.name.size ();
+
+    if (2 == num_ellipsis) {
         // advance to the next hex digit
-        while (   idx < end_tok.name.size ()
-               && !(std::isxdigit)(end_tok.name [idx]))
+        while (idx < end_len && !(std::isxdigit)(end_tok.name [idx]))
             ++idx;
     }
     else {
         // advance to the next dec digit
-        while (   idx < end_tok.name.size ()
-               && (std::isdigit)(end_tok.name [idx]))
+        while (idx < end_len && (std::isdigit)(end_tok.name [idx]))
             ++idx;
     }
 
-    num_str.clear ();
-
-    // advance to the closing bracket ('>')
-    while (idx < end_tok.name.size() && end_tok.name[idx] != '>')
-        num_str += end_tok.name [idx++];
-    
-    unsigned long end_num = std::strtoul (num_str.c_str(), (char**)0, base);
+    const unsigned long end_val =
+        std::strtoul (end_tok.name.c_str () + idx, (char**)0, base);
 
-    // the ending numeric value should be greater then the start numeric value
-    if (end_num < start_num)
+    // the ending numeric value must be greater than or equal
+    // to the beginning numeric value
+    if (end_val < beg_val)
         issue_diag (E_RANGE, true, &end_tok, 
                     "invalid range found in character map file\n");
     
     char next_name [MAX_SYM_NAME_LEN];
 
-    for (; start_num <= end_num; ++start_num) {
+    std::size_t nchars = 0;
 
-        std::sprintf (next_name, fmat, begin.c_str (), num_size, start_num);
-        
-        // increment the string value to the next encoded character value
-        strval = increment_strval (strval.c_str ());
+    const char* const pfx = beg_tok.name.c_str ();
+
+    for (unsigned long val = beg_val; val <= end_val; ++val, ++nchars) {
 
-        // and finally add the generated name and string value to the maps
-        add_to_cmaps (next_name, strval);
+        std::sprintf (next_name, fmat, pfx_len, pfx, num_size, val);
+        
+        // increment the last byte of the multibyte character
+        // and if the result is valid (i.e., doesn't contain
+        // an embedded NUL) add the generated name and the
+        // multibyte character to the maps
+        const unsigned char last_byte = mbchar [mbchar.size () - 1];
+        if (last_byte < UCHAR_MAX) {
+            mbchar [mbchar.size () - 1] = last_byte + 1;
+            add_to_cmaps (next_name, mbchar, true);
+        }
+        else {
+            // an ellipsis must not specify a range that includes
+            // an encoding with an embedded NUL
+            issue_diag (E_RANGE, true, &beg_tok, 
+                        "encoding of an element in range contains NUL\n");
+        }
     }
+
+    // return the number of characters denoted by the ellipsis
+    return nchars;
 }
 
 
 // process all the characters in the character map file.
 void Charmap::process_chars()
 {
+    issue_diag (I_STAGE, false, 0, "processing CHARMAP section\n");
+
+    std::size_t ntokens = 0;
+    std::size_t nellips = 0;
+    std::size_t nchars  = 0;
+
     next = scanner_.next_token();
     Scanner::token_t nextnext;
 
     // loop until we find the closing charmap token
-    while (next.token != Scanner::tok_charmap) {
+    for ( ; next.token != Scanner::tok_charmap; ++ntokens) {
 
-        switch (next.token){
+        switch (next.token) {
 
         case Scanner::tok_nl:
         case Scanner::tok_end:
             break;
 
         case Scanner::tok_sym_name:
-            // the next token may be either ellipsis if this line of the
-            // charmap is in the form: 
+            // the next token may be either ellipsis if this line
+            // of the charmap is in the form:
             // "%s...%s %s\n", <sym_name>, <sym_name>, <encoding>
             // or an encoding if this line is in the format:
             // "%s %s\n", <sym_name>, <encoding>
-            nextnext = scanner_.next_token();
+            nextnext = scanner_.next_token ();
+            ntokens += 3;
 
             switch (nextnext.token) {
 
-            case Scanner::tok_ellipsis:
-                process_ellipsis (next, 3);
+            case Scanner::tok_abs_ellipsis:
+                // absolute ellipsis (see ISO/IEC TR 14652)
+                nchars += process_ellipsis (next, 3);
+                ++nellips;
                 break;
 
-            case Scanner::tok_dellipsis:
-                process_ellipsis (next, 2);
+            case Scanner::tok_hex_ellipsis:
+                // hexadecimal symbolic ellipsis (see ISO/IEC TR 14652)
+                nchars += process_ellipsis (next, 2);
+                ++nellips;
                 break;
 
-            case Scanner::tok_decimal_value:
-            case Scanner::tok_hex_value:
-            case Scanner::tok_octal_value:
+            case Scanner::tok_char_value:
+                // character represented as a numeric constant
                 add_to_cmaps (next.name, nextnext.name);
+                ++nchars;
                 break;
 
             default:
@@ -752,7 +932,7 @@
                             "name in character map file\n");
             }
 
-            scanner_.ignore_line();
+            scanner_.ignore_line ();
             break;
 
         default:
@@ -762,9 +942,13 @@
         }
 
         next = scanner_.next_token();
-
     }
-    
+
+    issue_diag (I_STAGE, false, 0,
+                "done processing CHARMAP section (%lu tokens, "
+                "%lu ellipses, %lu characters)\n",
+                ntokens, nellips, nchars);
+
     // make sure that all characters in the portable character set
     // are in the charmap
     if (forward_maps)
@@ -772,37 +956,20 @@
 }
 
 
-void Charmap::verify_portable_charset()
+void Charmap::verify_portable_charset () const
 {
-    static const char* const charset[] = {
-        "<NUL>","<alert>", "<backspace>", "<tab>", "<newline>", 
-        "<vertical-tab>", "<form-feed>", "<carriage-return>", "<space>",
-        "<exclamation-mark>", "<quotation-mark>", "<number-sign>", 
-        "<dollar-sign>", "<percent-sign>", "<ampersand>", "<apostrophe>",
-        "<left-parenthesis>", "<right-parenthesis>", "<asterisk>", 
-        "<plus-sign>", "<comma>", "<hyphen>", "<hyphen-minus>", "<period>",
-        "<full-stop>", "<slash>", "<solidus>", "<zero>", "<one>", "<two>",
-        "<three>", "<four>", "<five>", "<six>", "<seven>", "<eight>",
-        "<nine>", "<colon>", "<semicolon>", "<less-than-sign>", 
-        "<equals-sign>", "<greater-than-sign>", "<question-mark>",
-        "<commercial-at>", "<A>", "<B>", "<C>", "<D>", "<E>", "<F>", "<G>",
-        "<H>", "<I>", "<J>", "<K>", "<L>","<M>", "<N>", "<O>", "<P>","<Q>",
-        "<R>", "<S>", "<T>", "<U>", "<V>", "<W>", "<X>", "<Y>", "<Z>",
-        "<left-square-bracket>", "<backslash>", "<reverse-solidus>",
-        "<right-square-bracket>", "<circumflex>", "<circumflex-accent>",
-        "<underscore>", "<low-line>", "<grave-accent>",
-        "<a>", "<b>", "<c>", "<d>", "<e>", "<f>", "<g>",
-        "<h>", "<i>", "<j>", "<k>", "<l>","<m>", "<n>", "<o>", "<p>","<q>",
-        "<r>", "<s>", "<t>", "<u>", "<v>", "<w>", "<x>", "<y>", "<z>",
-        "<left-brace>", "<left-curly-bracket>", "<vertical-line>", 
-        "<right-brace>", "<right-curly-bracket>", "<tilde>"
-    };
+    const std::size_t nchars =
+        sizeof portable_charset / sizeof *portable_charset;
+
+    for (std::size_t i = 0; i < nchars; ++i) {
+        if (0 == portable_charset [i])
+            continue;
 
-    for (std::size_t i = 0; i < sizeof charset / sizeof (char*); ++i)
-        if (n_cmap_.find (charset [i]) == n_cmap_.end ())
+        if (n_cmap_.find (portable_charset [i]) == n_cmap_.end ())
             issue_diag (W_NOPCS, false, 0, 
                         "member of portable character set %s not found "
-                        "in the character map\n", charset[i]);
+                        "in the character map\n", portable_charset [i]);
+    }
 }
 
 

Modified: incubator/stdcxx/trunk/util/charmap.h
URL: 
http://svn.apache.org/viewvc/incubator/stdcxx/trunk/util/charmap.h?view=diff&rev=448754&r1=448753&r2=448754
==============================================================================
--- incubator/stdcxx/trunk/util/charmap.h (original)
+++ incubator/stdcxx/trunk/util/charmap.h Thu Sep 21 17:42:16 2006
@@ -2,20 +2,27 @@
  *
  * charmap.h
  *
- * $Id: //stdlib/dev/source/stdlib/util/charmap.h#41 $
+ * $Id$
  *
  ***************************************************************************
  *
- * Copyright (c) 1994-2005 Quovadx,  Inc., acting through its  Rogue Wave
- * Software division. Licensed under the Apache License, Version 2.0 (the
- * "License");  you may  not use this file except  in compliance with the
- * License.    You    may   obtain   a   copy   of    the   License    at
- * http://www.apache.org/licenses/LICENSE-2.0.    Unless   required    by
- * applicable law  or agreed to  in writing,  software  distributed under
- * the License is distributed on an "AS IS" BASIS,  WITHOUT WARRANTIES OR
- * CONDITIONS OF  ANY KIND, either  express or implied.  See  the License
- * for the specific language governing permissions  and limitations under
- * the License.
+ * Licensed to the Apache Software  Foundation (ASF) under one or more
+ * contributor  license agreements.  See  the NOTICE  file distributed
+ * with  this  work  for  additional information  regarding  copyright
+ * ownership.   The ASF  licenses this  file to  you under  the Apache
+ * License, Version  2.0 (the  "License"); you may  not use  this file
+ * except in  compliance with the License.   You may obtain  a copy of
+ * the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the  License is distributed on an  "AS IS" BASIS,
+ * WITHOUT  WARRANTIES OR CONDITIONS  OF ANY  KIND, either  express or
+ * implied.   See  the License  for  the  specific language  governing
+ * permissions and limitations under the License.
+ *
+ * Copyright 2001-2006 Rogue Wave Software.
  * 
  **************************************************************************/
 
@@ -24,6 +31,7 @@
 #define _RWSTD_LOC_CHARMAP_H_INCLUDED
 
 #include <string>
+#include <list>
 #include <map>
 #include <set>
 
@@ -36,6 +44,8 @@
 class Charmap
 {
 public:
+    static const char* const portable_charset [];
+
     Charmap(const char* /*corresponding C library locale*/,
             const char* /*filename*/, 
             bool /*is utf8 encoding?*/, 
@@ -43,37 +53,45 @@
             bool /*create_reverse_charmaps*/, 
             bool /*use UCS4 internally*/);    
   
-    // get the narrow character map
-    const std::map<std::string, unsigned char >& get_n_cmap() const {
+    // returns the narrow character map which maps a symbolic character
+    // name to its narrow character value
+    const std::map<std::string, unsigned char>& get_n_cmap() const {
         return n_cmap_;
     }
     
-    // get the reverse narrow character map
-    const std::map<unsigned char, std::string >& get_rn_cmap() const {
+    // returns the reverse narrow character map which maps a narrow
+    // character value to its symbolic name
+    const std::map<unsigned char, std::string>& get_rn_cmap() const {
         return rn_cmap_;
     }
 
-    // get the wide character map
-    const std::map<std::string, wchar_t >& get_w_cmap() const {
+    // returns the wide character map which maps a symbolic character
+    // name to its wide character value
+    const std::map<std::string, wchar_t>& get_w_cmap() const {
         return w_cmap_;
     }
 
-    // get the reverse wide character map
-    const std::map<wchar_t, std::string >& get_rw_cmap() const {
+    // returns the reverse wide character map which maps a wide
+    // character value to its symbolic name
+    const std::map<wchar_t, std::string>& get_rw_cmap() const {
         return rw_cmap_;
     }
 
-    const std::map<std::string, wchar_t>& get_n_cmap2() const {
-        return n_cmap2_;
+    // returns the multibyte character map which maps a multibyte
+    // character to its corresponding wide character value
+    const std::map<std::string, wchar_t>& get_mb_cmap() const {
+        return mb_cmap_;
     }
 
-    const std::map<wchar_t, std::string>& get_rn_cmap2() const {
-        return rn_cmap2_;
+    // returns the reverse multibyte character map which maps a wide
+    // character value to its corresponding multibyte character
+    const std::map<wchar_t, std::string>& get_rmb_cmap() const {
+        return rmb_cmap_;
     }
 
     // get the string value map
-    const std::map<std::string, std::string>& get_strval_map() const {
-        return strval_map_;
+    const std::list<std::string>& get_symnames_list() const {
+        return symnames_list_;
     }
 
     const std::map <std::string, wchar_t>& get_ucs4_cmap () const {
@@ -84,14 +102,6 @@
         return rucs4_cmap_;
     }
 
-    const std::set<std::string>& get_valid_mb_set() const {
-        return valid_mb_set_;
-    }
-
-    const std::set<std::string>& get_valid_wchar_set() const {
-        return valid_wchar_set_;
-    }
-
     // return the value of mb_cur_max
     int get_mb_cur_max() const {
         return mb_cur_max_;
@@ -128,32 +138,36 @@
 
     unsigned char get_largest_nchar () const;
     
-    // increment the wide character value to the next encoded character in
-    // this codeset
-    wchar_t increment_val (const wchar_t) const;
+    // increments the wide character value to the next encoded character
+    // in the current codeset; returns the incremented value or -1 on
+    // error
+    wchar_t increment_wchar (wchar_t) const;
 
 private:
     
-    // process the characters implicitly defined by using ellipsis between
-    // two explicitly defined characters
-    void process_ellipsis (const Scanner::token_t&, int);
+    // processes characters implicitly defined by an ellipsis denoted
+    // by two explicitly defined characters; returns the number of
+    // characters in the range, -1 on error
+    std::size_t process_ellipsis (const Scanner::token_t&, int);
 
     // process the charmap file making the necessary mappings in the cmaps
     void process_chars();
 
-    // increment the multi-byte string by 1.
-    const char* increment_strval (const char*);
+    // increment the encoded multi byte character argument
+    bool increment_encoding (std::string&);
 
-    // make sure that all the characters in the portable character set are
-    // defined in the character map
-    void verify_portable_charset ();
+    // verify that all the characters in the portable character set
+    // are defined in the character map
+    void verify_portable_charset () const;
 
 #ifndef _MSC_VER
     // open the iconv descriptor to convert to utf8
     iconv_t open_iconv_to_utf8 () const;
 #endif  // _MSC_VER
 
-    std::string parse_ext_strval (const std::string&) const;
+    // convert a human-readable encoding of a character
+    // to its raw multibyte character representation
+    std::string encoding_to_mbchar (const std::string&) const;
 
     // convert a multi-byte string to a utf8 multi-byte string
     char* convert_to_utf8 (const char *inbuf, std::size_t inbuf_s, 
@@ -167,8 +181,11 @@
 #  endif   // _RWSTD_NO_ISO_10646_WCHAR_T
 #endif  // _MSC_VER
     
-    // add the sym_name and multi-byte character to the character maps
-    void add_to_cmaps (const std::string&, const std::string&);
+    // add the symbolic name of a character and the raw multibyte
+    // character corresponding to it to the character maps
+    void add_to_cmaps (const std::string&,
+                       const std::string&,
+                       bool = false);
         
     // the scanner used to process the charmap file
     Scanner scanner_;
@@ -181,16 +198,18 @@
 #endif // _MSC_VER
 
     // n_cmap maps the symbolic name to a narrow character value
-    // rn_cmap does exactly the opposite
+    // rn_cmap does the opposite
     std::map <std::string, unsigned char> n_cmap_;
     std::map <unsigned char, std::string> rn_cmap_;
-    std::map <std::string, wchar_t> n_cmap2_;
-    std::map <wchar_t, std::string> rn_cmap2_;
-    typedef std::map <wchar_t, std::string>::const_iterator rn_cmap2_iter;
-    typedef std::map <std::string, wchar_t>::const_iterator n_cmap2_iter;
 
-    std::set<std::string> valid_mb_set_;
-    std::set<std::string> valid_wchar_set_;
+    // mb_cmap maps a multibyte character representation to its
+    // corresponding wide character value
+    // rmb_cmap does the opposite
+    std::map <std::string, wchar_t> mb_cmap_;
+    std::map <wchar_t, std::string> rmb_cmap_;
+
+    typedef std::map <wchar_t, std::string>::const_iterator rmb_cmap_iter;
+    typedef std::map <std::string, wchar_t>::const_iterator mb_cmap_iter;
 
     // w_cmap maps the symbolic name to a wide character value
     // rw_cmap does exactly the opposite 
@@ -232,11 +251,8 @@
     // should we use UCS4 as the internal representation
     bool UCS4_internal_;
 
-    // maps the string value to the symbolic name
-    // this map is required for the UNDEFINED keyword
-    // in localedef where the elements must be added in
-    // increasing encoded order.
-    std::map<std::string, std::string> strval_map_;
+    // list of all known symbolic character names
+    std::list<std::string> symnames_list_;
 
     Scanner::token_t next;
 };

svn commit: r448754 [1/6] - /incubator/stdcxx/trunk/util/

Reply via email to