At 10:55 PM 9/20/2004, Doug Ewell wrote:
Jörg Knappen <knappen at uni dash mainz dot de> wrote:

> I see a precedent in Unicode to treat Copyright-like sign differently
> from simple encircled letters:
>
> Unicode takes precautions not to encode the same character twice.
> Therefore, superscript digits 2 and 3 are absent from the superscript
> block U+2070 ff.
>
> However, the Block eclosed alphanumerics U+2460 ff includes encircled
> capital latin letters C, P, and R in addition to the copyright-like
> sing elsewhere.

OK, I guess I need some guidance from the Unicode elder statesmen and
greater experts.

Ken and I have both given quite a bit of guidance on this issue. It would have been nice if you had acknowledged the pertinent aspects of our postings - so that we don't need to repeat them.

I have been under the impression all along that what Jörg calls
"copyright-like signs," meaning U+00A9 and U+00AE and U+2117 and
possibly others, are encoded are separate entities primarily because
they were in pre-existing legacy character sets.  Remember that a major
goal of Unicode at its inception was to make sure all such character
sets were covered.

Obviously U+00A9 and U+00AE were in ISO 8859-1, at those same code
points.  They also appeared in MS-DOS code page 850, which also predated
Unicode.  I don't know if U+2117 was in any existing standards; I just
know it's in my Unicode 1.0 book.

There's a difference between immediate rationale and long term policy. The initial collection (except for 2117) was indeed dictated by the desire for compatibility. 2117 was added explicitly as an analogue - Although there may be a citation for it in some character set, I don't recall that such a citation was used as the primary argument at the time.

Jörg's comments imply that these symbols are in Unicode because of a
policy or "precedent" for treating such symbols specially, not (or not
only) because of the policy of encoding whatever was in the legacy
character sets of the time.

Ken gave a very good answer on why sometimes a combination can't be used for a symbol. I won't repeat his arguments here. I just want to note that we are still relatively unconstrained in setting the long-term policy here. There's tension between those that see the circle used generically, and would like to see combining sequences as the standard solution, and those that see the combination take on different properties so that in the end, separate encoding is warranted (for at least a subset of such symbols).

Let's suppose we were back in the mid-'90s, and for whatever reason, the
circled Latin letters in the U+24xx block were already encoded but the
three "copyright-like signs" were not.  Suppose they weren't in any
legacy character sets either.  (Use your imagination.)

Now suppose someone proposed that the circled-C copyright symbol
(picking the most widely used example) be encoded as a separate entity.
Suppose further that someone else pointed out that it could be
represented by one of the circled Latin letters in the U+24xx zone (Ⓒ or
ⓒ), and a debate ensued over whether those letters were of the correct
size.

Sizing of symbols is very tricky. There are clearly instances where size is directly linked to semantics (see N-ary orperators for example). There are other instances, such as (R) where the size and position of the symbol varies enormously in text. Then there are symbols where the absolute size in final form is proscribed by context. The ESTIMATED symbol must be in a specific size depending on the value of the *weight* associated with it.

At some point along this trajectory you leave plain text behind, whether
you want to or not.

Finally, let's suppose that someone else suggested using the combination
U+0043 (or U+0063) plus U+20DD, the combining enclosing circle, and that
we then had a debate over whether fonts and rendering engines were up to
the task.

What would UTC and WG2 do?  Would they choose to encode COPYRIGHT SIGN
on its own, recommend the existing circled Latin letters, or recommend
the combining sequence?  Why?  (Use a separate sheet of paper if
necessary.)

In the scenario you describe, you can assume that users would have used the existing code at 24B8 since that's what would have been immediately available. Given the existence of that kind of legacy, it's much less like that any alternative would garner official support. Legacy, if wide- spread, is a strong argument.

However - it does not apply to our case, because the combining sequence
is not currently (and may never be) a good solution.

On Tue, 21 Sep 2004 14:03:04  "Anto'nio Martins-Tuva'lkin" wrote:

The Standard notes for U+2139 that it is �intended for use with 20DD�.

I regard this as one of the 'still-born' parts of Unicode. I think there are complexities in the use of enclosing marks that people were not realizing. I see a certain amount of caution in the UTC nowadays when it comes to using 20DD.

Note 26A0, which is *not* <0021, 20E4>

A./





Reply via email to