On 8/1/2011 7:26 AM, Naena Guru wrote:
This thread wandered off into an argument about whether U+FEFF ZWNBSP or
U+2060 WJ is best supported and which should be used to inhibit line breaks.
However, there are still several other issues which bear addressing in
Naena Guru's
questions:
The Unicode character NBH (No Break Here: U0083) is understood as a
hidden character that is used to keep two adjoining visual characters
from being separated in operations such as word wrapping.
As Jukka noted, U+0083 is a C1 control code, whose semantics is not actually
defined by the Unicode Standard. Its function in ISO 6429 is to
represent the
control function No Break Here. U+0083 is unlikely to be supported
(except for
pass-through) by any significant Unicode-based software as a control
function.
Its only implementation was likely for some terminal-based software in
what are now
basically obsolete systems.
See the wiki on the topic of C0 and C1 control codes for a quick summary
of the
status of various control codes and their implementation:
http://en.wikipedia.org/wiki/C0_and_C1_control_codes
It seems to be similar to ZWJ (Zero Width nonJoiner: U200C) in that it
can prevent automatic formation of a ligature as programmed in a font.
U+200C ZWNJ is the Unicode format control whose function is to break cursive
connection between adjacent characters. That is a different and distinct
function
from indicating the position of an inhibited line break.
Also, it is important to recognize that the insertion of *any* random
control code between
two characters may end up preventing automatic formation of a font
ligature, if it isn't
accounted for in the font tables. That does not imply that insertion of
random control
codes (including U+0083) is a recommended way of inhibiting ligature
formation for
a pair of characters in a particular font.
However, it seems to me that an NBH evokes a question mark (?) Is this
an oversight by implementers or am I making wrong assumptions?
Because most control codes, including nearly all of the C1 control
codes, are unsupported
by typical Unicode-based text processing software, it is not too
surprising that insertion
of U+0083 in text would result in a ? or other indication of an
unsupported and/or undisplayable
character.
There is also the NBSP (No-break Space: U00A0), which I think has to
be mapped to the space character in fonts, that glues two letters
together by a space. If you do not want a space between two letters
and also want to prevent glyph substitutions to happen, then NBH seems
to be the correct character to use.
No. And that leads to the discussion which followed, about U+FEFF and
U+2060.
NBH is more appropriate for use within ISO-8859-1 characters than
ZWNJ, because the latter is double-byte.
Double-byte is not a concept with any applicability to the Unicode
Standard. That is a hold-over
from Asian character sets which mixed ASCII with two-byte encoding of
extensions to
cover Han characters (and other additions).
And U+0083 is no more appropriate for use with ISO 8859-1
implementations than
Unicode implementations, for the same reason: it is a control function
which simply isn't supported.
Programs that handle SBCS well ought to be afforded the use of NBH as
it is a SBCS character. Or, am I completely mistaken here?
If you actually run into the byte 0x83 in data which is ostensibly
labeled ISO-8859-1, in
almost all actual cases you would be dealing instead with 0x83 (= U+0192
LATIN SMALL LETTER F
WITH HOOK) in mislabeled Windows Code Page 1252 data. It would be really
inadvisable
to start expecting it to be supported as a line break inhibiting control
code instead.
--Ken