I would like to add some information here without getting myself into the core of the discussion:

HTML recognizes a lot fewer "whitespace" characters than Java or Unicode. Different people have different sets of "whitespace" characters.

Unicode's White_Space property (PropList.txt) contains 24 code points (Unicode 3.2) but not U+FEFF.

U+FEFF ZWNBSP is a format control (Cf), not any kind of space in the usual sense.

U+FEFF, like all Cf, is a Default_Ignorable_Code_Point (DerivedCoreProperties.txt). (That is, sorting, searching, matching, etc. usually ignore it unless such code points are explicitly useful.)

RFC 2279 *is* being updated, see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-03.txt
Version -04 is supposed to be public shortly.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.


Reply via email to