On 8/1/2011 7:26 AM, Naena Guru wrote:

This thread wandered off into an argument about whether U+FEFF ZWNBSP or
U+2060 WJ is best supported and which should be used to inhibit line breaks.
However, there are still several other issues which bear addressing in Naena Guru's
questions:

The Unicode character NBH (No Break Here: U0083) is understood as a hidden character that is used to keep two adjoining visual characters from being separated in operations such as word wrapping.

As Jukka noted, U+0083 is a C1 control code, whose semantics is not actually
defined by the Unicode Standard. Its function in ISO 6429 is to represent the control function "No Break Here". U+0083 is unlikely to be supported (except for pass-through) by any significant Unicode-based software as a control function. Its only implementation was likely for some terminal-based software in what are now
basically obsolete systems.

See the wiki on the topic of C0 and C1 control codes for a quick summary of the
status of various control codes and their implementation:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

It seems to be similar to ZWJ (Zero Width nonJoiner: U200C) in that it can prevent automatic formation of a ligature as programmed in a font.

U+200C ZWNJ is the Unicode format control whose function is to break cursive
connection between adjacent characters. That is a different and distinct function
from indicating the position of an inhibited line break.

Also, it is important to recognize that the insertion of *any* random control code between two characters may end up preventing automatic formation of a font ligature, if it isn't accounted for in the font tables. That does not imply that insertion of random control codes (including U+0083) is a recommended way of inhibiting ligature formation for
a pair of characters in a particular font.

However, it seems to me that an NBH evokes a question mark (?) Is this an oversight by implementers or am I making wrong assumptions?

Because most control codes, including nearly all of the C1 control codes, are unsupported by typical Unicode-based text processing software, it is not too surprising that insertion of U+0083 in text would result in a "?" or other indication of an unsupported and/or undisplayable
character.

There is also the NBSP (No-break Space: U00A0), which I think has to be mapped to the space character in fonts, that glues two letters together by a space. If you do not want a space between two letters and also want to prevent glyph substitutions to happen, then NBH seems to be the correct character to use.

No. And that leads to the discussion which followed, about U+FEFF and U+2060.

NBH is more appropriate for use within ISO-8859-1 characters than ZWNJ, because the latter is double-byte.

"Double-byte" is not a concept with any applicability to the Unicode Standard. That is a hold-over from Asian character sets which mixed ASCII with two-byte encoding of extensions to
cover Han characters (and other additions).

And U+0083 is no more appropriate for use with ISO 8859-1 implementations than Unicode implementations, for the same reason: it is a control function which simply isn't supported.

Programs that handle SBCS well ought to be afforded the use of NBH as it is a SBCS character. Or, am I completely mistaken here?

If you actually run into the byte 0x83 in data which is ostensibly labeled "ISO-8859-1", in almost all actual cases you would be dealing instead with 0x83 (= U+0192 LATIN SMALL LETTER F WITH HOOK) in mislabeled Windows Code Page 1252 data. It would be really inadvisable to start expecting it to be supported as a line break inhibiting control code instead.

--Ken


Reply via email to