On 10/20/2013 1:47 AM, Jukka K. Korpela wrote:
2013-10-20 2:38, Richard Wordingham wrote:
Is a sequence of a U+25CC DOTTED CIRCLE plus a combining mark plain
text?
Well, is <h1>hello<h1> plain text? The answer is that any string of
characters may be considered as plain text and any string of
characters may be treated as rich text according to some conventions.
Unless some such conventions have been established, a string of
character codes is plain text.
A random implementation choice is a bug, not a convention.
Just because Unicode does not provide a method to announce or register a
convention doesn't mean all behavior should reverently be treated as a
convention.
If so, how many dotted circles should appear?
Possibly none. An implementation need not support any particular
collection of characters. But an implementation that supports U+25CC
must treat it as a spacing character, and an implementation that
supports e.g. U+0300 must treat it as a combining mark. So if the
implemention is capable of visually rendering them, it shall render
U+25CC U+0300 as a dotted circle with an acute accent above it. In
this case, exactly one dotted circle should appear, then.
Implementations often have bugs in dealing with combinining mark. This
may depend on the rendering software, or on the font.
And bugs are bugs and not conventions.
If the sequence is not plain text, what mark-up notations are
available to control the number of dotted circles produced? I
am particularly interested in notation for HTML, e.g. via a style
sheet. Should the sequence instead be treated as a graphic?
I don’t understand these questions. If the sequence is treated as
other than plain text, then the results depend on the specific “rich
text” or other conventiones applied.
A typical convention is the "show special characters" in many editors.
If such a feature included making visible combining marks by forcing
them to appear as isolated marks (not part of a sequence) and over a
dotted circle.
There are some conventions that show an extra dotted circle for certain
ill-formed sequences involving combining marks. Script-specific
combining marks may indeed have contexts in which they make no possible
sense. General purpose combining marks are not so restricted and to show
dotted circles with them is a bug.
Incidentally, the dotted circle shown in the Unicode Code charts is
*not* 25CC, and if I were to implement a "show dotted circle" feature in
a program I would not use 25CC for this - that character has a standard
glyph of rather unsuitable metrics for the purpose, never mind that many
people have co-opted it.
This question is prompted by a confused discussion of what the notation
<U+25CC, U+0E31 THAI CHARACTER MAI HAN-AKAT, U+25CC> on a web page
meant.
What it means is a different issue. U+25CC is a symbol that can be
used in a variety of meanings. I don’t think it means anything
specific to most people, unless a definition is given. U+0E31 is a
Thai vowel sign, and I don’t think any meaning in general has been
assigned to it when applied to something else than a Thai letter.
The rendering of the sequence is a different matter. Not surprisingly,
tests on IE 10 show varying results. Using my test page
http://www.cs.tut.fi/~jkorpela/listfonts1.html
that renders, on IE, a given string in all the fonts available in the
system, I noticed that on my system, only SunExt-A and Unifont result
in correct rendering. Using Arial Unicode MS, the rendering is correct
except for the circles being dashed, and I think this is incorrect for
U+25CC, as it violates the identity of the character as a dotted circle.
Not really - if you go back to the originals, e.g. early versions of
Unicode you see dashed circles. Unicode 2.0 clearly shows a dashed
circle and for that edition, I believe, we are talking about the first
use of outline fonts for code charts around that time.
A few other fonts contain the characters too, but the renderings have
three similar dotted rings, with the Thai diacritic above the middle
one or (in FreeSerif and Quivira) between the 2nd and 3rd. – On
Chrome, Safari, and Firefox, the results are similar, except that
Chrome shows the string as broken even when Arial Unicode MS is declared.
The confusion was caused because some of us saw two dashed
circles and others saw three dashed circles (one for each character)
when viewing the web page.
The implementations that show three dotted circles are non-conforming.
Showing three dashed circles would be even more non-conforming.
If the purpose is to display the combining diacritic the same way as
in the code charts in the standard, i.e. with a dotted symbol
appearing as generically showing the place of a base character, then
I’m afraid the approach does not work in general. It should work, in
the sense that conforming implementations would render it the desired
way if they support the characters in rendering, but web browsers just
don’t conform.
Except that there is no character in the standard that matches (by
identity) the dotted glyph used in the code charts.
A./
What you could do in a web page is to put U+00A0 U+25CC in one element
and U+0E31 in another and position the elements in the same place, set
to have the same width and to be horizontally centered. But I’m afraid
this would be off-topic here and could involve some nasty details.
Yucca