I think this discussion is bogging down because several different questions are getting mixed together. Here's what I see as the major issues:
1. Does Unicode specify a single correct way of representing white space? 2. If an input file to XeTeX contains currently less common Unicode whitespace code points, such as U+00A0, what should XeTeX do? 3. Should users be encouraged, or even required, to include those code points in input to XeTeX, in order to achieve typesetting goals that in older TeX engines were achieved by other means? 4. Since many editing environments make it inconvenient to process currently less common Unicode whitespace code points, what should users do if the answer to #3 is "yes"? Now, separate from identifying what the questions are, here's what I think are reasonable answers to the questions: 1. No. That is not what Unicode is for. Unicode's goal is to subsume all reasonable pre-existing encodings. Some reasonable pre-existing encodings include a non-breaking space character, so Unicode includes one. That does not mean Unicode says you should actually use it! There are many precedents of Unicode providing multiple ways of representing things, as a result of including characters from other systems, without it being reasonable to demand that all Unicode-compatible systems must support all of them. For instance, most of the U+FFxx range is devoted to different kinds of hacks for handling partial-width characters in Asian-language typesetting; the preferred way to do that nowadays is via OpenType features, but the code points remain in the standard. The U+0000 to U+001F range is basically control characters for Teletype machines; some of those, like U+000A and U+000D, are widely used in modern documents (but in varying ways by different systems!) and others, like U+001D, are virtually unheard-of. Unicode does NOT say everybody has to support them all let alone all in the same way. The U+00A0 code points is not explicitly deprecated in Unicode, but it was never a principle of Unicode that all implementations have to support all defined control characters regardless of appropriateness to the particular purpose. "Non-breaking space" is, from TeX's point of view, not really a character at all, but a formatting command; and TeX already has a way of dealing with formatting commands in general and this one in particular. It is appropriate to say that the preferred way of handling non-breaking spaces in TeX input is the existing TeX way; and saying that in NO WAY AT ALL contradicts anything in Unicode. Unicode is servant, not master. 2. Inevitably, people will include invalid characters in TeX input; and U+00A0 is an invalid character for TeX input. The best way to deal with it is to treat it like any other invalid character and generate an error message. A reasonable alternative would be to say "it is whitespace; it will be treated like other whitespace." That would mean ignoring its breaking/non-breaking-ness, as we have for a long time similarly ignored the special properties of U+0009 (tab). Of course, if users want to define a special meaning for U+00A0 in their own input, they can do so with the existing mechanisms for redefining the meanings of input characters; but "U+00A0 is equivalent to U+007E (~)," for instance, should never be the default and (because of trouble displaying it) shouldn't be encouraged. 3. No. Better to keep everything visible and backward compatible. U+007E (~) should remain the preferred way of doing non-breaking space. 4. Not applicable because of the answer to #3. Users who do insist on putting U+00A0 in their input presumably have *already* got their own reasons to think that it's more convenient for them, including solutions satisfactory to themselves for how to type it on keyboards and see it on screens, so that's their business and not a problem we need to solve. -- Matthew Skala [email protected] People before principles. http://ansuz.sooke.bc.ca/ -------------------------------------------------- Subscriptions, Archive, and List information, etc.: http://tug.org/mailman/listinfo/xetex
