[ http://nagoya.apache.org/jira/browse/XERCESC-770?page=history ]
Alberto Massari updated XERCESC-770: ------------------------------------ Priority: Major > IANA charset names list inefficient; useful? > -------------------------------------------- > > Key: XERCESC-770 > URL: http://nagoya.apache.org/jira/browse/XERCESC-770 > Project: Xerces-C++ > Type: Bug > Components: Utilities > Versions: 2.1.0 > Environment: Operating System: All > Platform: All > Reporter: Markus Scherer > Assignee: Xerces-C Developers Mailing List > > The IANA charset names list is stored inefficiently. It alone takes up 200 kB > in the Xerces library. > internal/IANAEncodings.hpp contains const XMLCh gEncodingArray[791][128]. This > uses sizeof(XMLCh)*791*128 or about 200000 bytes. Most of the names are shorter > than 15 or so characters, and only ASCII characters are ever used in IANA > charset names. The names should therefore be stored as ASCII bytes, and only as > many per name as necessary. > As a simpler means of making this array smaller, the IANA charset registration > imposes an upper limit of 40 characters for charset names. There are only two > registered names that violate this (I think), they could be safely omitted. Add > space for the NUL. 128 characters per name is way overkill. > I also wonder whether this list is useful at all. Xerces only verifies that a > name exists in the list. It does not verify that it has a converter for it > (other than failing to open it, which does not use this list). It cannot verify > that what the XML document claims its charset is matches the converter that > Xerces is going to open for this name (e.g., mismatches between Shift-JIS etc. > among Windows/Unix/mainframe, see W3C Japanese profile for XML). > I suggest to add a compile-time option (#ifdef) to remove the IANA charset name > list (#ifdef out the use of EncodingValidator in util/TransService.cpp). > Note that ICU4C 2.2+ has data structures and APIs for dealing with charset > names associated with various standards (like IANA) and platforms. ICU4C does > not have a complete list of IANA names, but this is a matter of adding them to > its convrtrs.txt, not a real implementation issue. > Best regards, > markus -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://nagoya.apache.org/jira/secure/Administrators.jspa - If you want more information on JIRA, or have a bug to report see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]