* Mark Davis
|
| The HTML spec depends on the SGML spec for a characterization of
| allowable characters. The latter, unfortunately, disallows some
| valid Unicode characters (most C0 controls), but inconsistently
| allows other similar characters (C1 controls). 

SGML is silent on the issue of what characters are allowed. It is the
SGML declaration used by each application which decides this, and you
can easily make an SGML declaration which allows every Unicode
character.

To wit:

<!SGML  "ISO 8879:1986 (WWW)"
     CHARSET
          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       55296   0
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344 

CAPACITY        SGMLREF
                TOTALCAP        150000
                GRPCAP          150000
                ENTCAP          150000 

SCOPE    DOCUMENT
SYNTAX
         SHUNCHAR NONE
         BASESET  "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET  0 128 0          FUNCTION
                  RE            13
                  RS            10
                  SPACE         32
                  TAB SEPCHAR    9          

         NAMING   LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"   
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY  NO

         DELIM    GENERAL  SGMLREF
                  HCRO "&#38;#x"   -- 38 is the number for ampersand --
                  SHORTREF SGMLREF
         NAMES    SGMLREF
         QUANTITY SGMLREF
                  ATTCNT   60      -- increased --
                  ATTSPLEN 65536   -- These are the largest values --
                  LITLEN   65536   -- permitted in the declaration --
                  NAMELEN  65536   -- Avoid fixed limits in actual --
                  PILEN    65536   -- implementations of HTML UA's --
                  TAGLVL   100
                  TAGLEN   65536
                  GRPGTCNT 150
                  GRPCNT   64 

FEATURES
  MINIMIZE
    DATATAG  NO
    OMITTAG  YES
    RANK     NO
    SHORTTAG YES
  LINK
    SIMPLE   NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR   NO
    SUBDOC   NO
    FORMAL   YES
  APPINFO NONE
>

| That means that it is not possible in HTML (or more importantly, in
| XML) to represent all valid Unicode characters in data fields.

What would you want to use control characters for in an XML document?

--Lars M.


Reply via email to