Re: Gaps in Brahmic scripts section of SMP

2011-08-08 Thread Ken Whistler

On 8/2/2011 3:26 PM, stas624-...@yahoo.com wrote:

[Mainly aimed at people who can change roadmaps]
[I used online feedback form, but got no responce, so reposting it here.]


Your feedback was forwarded to the roadmap committee, which will consider
it in the context of other requests and suggestions regarding new and 
existing

allocations. You can always discuss such issues here on the unicode general
discussion list, of course, but as Shriramana Sharma noted, the online 
feedback

form is the way to get official notice of input such as this.



There are 6 small gaps in 'Brahmic scripts' section of SMP, which are quite 
useless I think. They can be easily united to form 3 larger gaps, in which new 
scripts can fit.


These will be addressed. Some might be easy to fix. Others not.



Finally, swap Takri+Jenticha--  Satavahana+2gaps to get yet another 4gaps.


Takri cannot be moved. It has already passed its final technical ballots.



If you think it is premature because block sizes can change, I'd propose at 
least move Tirhuta as it'll be too late when it is allocated.

I think those small gaps will hardly ever be filled with extensions to already 
encoded Brahmic scripts.


That is probably true. But keep in mind that all of the preliminary 
allocations (the ones
in red and blue on the roadmap) are subject to further updating before 
anything is
finally approved. It is quite common for proposed allocations for 
historic Brahmi-derived
scripts to gain (or lose) columns as the proposals become more mature 
and get closer
to approval. So it can be a mistake to try to engage in too much 
fine-tuning of the

roadmap too early on in the process.

Also, it isn't that great a concern that a few columns here and there on 
the SMP may end
up more or less permanently unallocated, as long as the overall 
allocation is

reasonably compact.

--Ken





Re: How is NBH (U0083) Implemented?

2011-08-08 Thread Ken Whistler

On 8/1/2011 7:26 AM, Naena Guru wrote:

This thread wandered off into an argument about whether U+FEFF ZWNBSP or
U+2060 WJ is best supported and which should be used to inhibit line breaks.
However, there are still several other issues which bear addressing in 
Naena Guru's

questions:

The Unicode character NBH (No Break Here: U0083) is understood as a 
hidden character that is used to keep two adjoining visual characters 
from being separated in operations such as word wrapping. 


As Jukka noted, U+0083 is a C1 control code, whose semantics is not actually
defined by the Unicode Standard. Its function in ISO 6429 is to 
represent the
control function No Break Here. U+0083 is unlikely to be supported 
(except for
pass-through) by any significant Unicode-based software as a control 
function.
Its only implementation was likely for some terminal-based software in 
what are now

basically obsolete systems.

See the wiki on the topic of C0 and C1 control codes for a quick summary 
of the

status of various control codes and their implementation:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

It seems to be similar to ZWJ (Zero Width nonJoiner: U200C) in that it 
can prevent automatic formation of a ligature as programmed in a font.


U+200C ZWNJ is the Unicode format control whose function is to break cursive
connection between adjacent characters. That is a different and distinct 
function

from indicating the position of an inhibited line break.

Also, it is important to recognize that the insertion of *any* random 
control code between
two characters may end up preventing automatic formation of a font 
ligature, if it isn't
accounted for in the font tables. That does not imply that insertion of 
random control
codes (including U+0083) is a recommended way of inhibiting ligature 
formation for

a pair of characters in a particular font.

However, it seems to me that an NBH evokes a question mark (?) Is this 
an oversight by implementers or am I making wrong assumptions?


Because most control codes, including nearly all of the C1 control 
codes, are unsupported
by typical Unicode-based text processing software, it is not too 
surprising that insertion
of U+0083 in text would result in a ? or other indication of an 
unsupported and/or undisplayable

character.


There is also the NBSP (No-break Space: U00A0), which I think has to 
be mapped to the space character in fonts, that glues two letters 
together by a space. If you do not want a space between two letters 
and also want to prevent glyph substitutions to happen, then NBH seems 
to be the correct character to use.


No. And that leads to the discussion which followed, about U+FEFF and 
U+2060.


NBH is more appropriate for use within ISO-8859-1 characters than 
ZWNJ, because the latter is double-byte. 


Double-byte is not a concept with any applicability to the Unicode 
Standard. That is a hold-over
from Asian character sets which mixed ASCII with two-byte encoding of 
extensions to

cover Han characters (and other additions).

And U+0083 is no more appropriate for use with ISO 8859-1 
implementations than
Unicode implementations, for the same reason: it is a control function 
which simply isn't supported.


Programs that handle SBCS well ought to be afforded the use of NBH as 
it is a SBCS character. Or, am I completely mistaken here?


If you actually run into the byte 0x83 in data which is ostensibly 
labeled ISO-8859-1, in
almost all actual cases you would be dealing instead with 0x83 (= U+0192 
LATIN SMALL LETTER F
WITH HOOK) in mislabeled Windows Code Page 1252 data. It would be really 
inadvisable
to start expecting it to be supported as a line break inhibiting control 
code instead.


--Ken