Re: Encoding italic

2019-02-12 Thread Kent Karlsson via Unicode


Oh, the crystal ball is pure solid state, no moving or hot parts.
A magic 8-ball on the other hand can easily get jammed...

(Now, enough of that...)

/K


Den 2019-02-12 02:57, skrev "James Kass via Unicode" :

> 
> On 2019-02-11 6:42 PM, Kent Karlsson wrote:
> 
>> Using a VS to get italics, or anything like that approach, will
>> NEVER be a part of Unicode!
> 
> Maybe the crystal ball is jammed.  This can happen, especially on the
> older models which use vacuum tubes.
> 
> Wanting a second opinion, I asked the magic 8 ball:
> ³Will VS14 italic be part of Unicode?²
> The answer was:
> ³It is decidedly so.²
> 





Re: Encoding colour (from Re: Encoding italic)

2019-02-12 Thread Kent Karlsson via Unicode


Den 2019-02-12 03:20, skrev "Mark E. Shoulson via Unicode"
:

> On 2/11/19 5:46 PM, Kent Karlsson via Unicode wrote:
>> Continuing too look deep into the crystal ball, doing some more
>> hand swirls...
>> 
>> ...
>> 
>> ...
>> 
>> The scheme quoted (far) below (from wjgo_10009), or anything like it,
>> will NEVER be part of Unicode!
> 
> Not in Unicode, but I have to say I'm intrigued by the idea of writing
> HTML with tag characters (not even necessarily "restricted" HTML: the
> whole deal).  This does NOT make it possible to write "italics in plain
> text," since you aren't writing plain text.  But what you can do is
> write rich text (HTML) that Just So Happens to look like plain text when
> rendered with a plain-text-renderer (and maybe there could be
> plain-text-renderers that straddle the line, maybe supporting some
> limited subset of HTML and doing boldface and italics or something. 

And so would ESC/command sequences as such, if properly skipped for display.
If some are interpreted, those would affect the display of other characters.
Just like "HTML in tag characters" would. A show invisibles mode would
display both ESC/command sequences as well as "HTML in tag characters"
characters.

> BUT, this would NOT be a Unicode feature/catastrophe at all.  This would
> be purely the decision of the committee in charge of HTML/XML and
> related standards, to decide to accept Unicode tag characters as if they
> were ASCII for the purposes of writing XML tags/attributes   It's

I have no say on HTML/CSS, but I would venture to predict that those
who do have a say, would not be keen on that idea. And XML tags in
general need not be in ASCII. And... identifiers in CSS need not
be in pure ASCII either... And attribute values, like filenames
including those that refer to CSS files (CSS is preferably stored
separately from the HTML/XML), certainly need not be pure ASCII.)

So, no, I'd say that that idea is completely dead.

/Kent K


> totally nothing to do with Unicode, unless the XML folks want Unicode to
> change some properties on the tag chars or something.  I think it's a...
> fascinating idea, and probably has *disastrous* consequences lurking
> that I haven't tried to think of yet, but it's not a Unicode idea.
> 
> ~mark
> 





Re: Encoding italic

2019-02-11 Thread Kent Karlsson via Unicode


Den 2019-02-11 10:55, skrev "wjgo_10...@btinternet.com via Unicode"
:

> Doug Ewell wrote:
> 
>> Š, just as next to nobody is using the proposed VS14 mechanism Š
> 
> Well, of course not because use of VS14 in a plain text document to
> record a request for an italic glyph version is not at the present time
> an official part of Unicode.

Looking deeply into the crystal ball, swirling my hands over it...

...

...

Using a VS to get italics, or anything like that approach, will
NEVER be a part of Unicode!

/Kent K





Re: Encoding italic

2019-02-10 Thread Kent Karlsson via Unicode




Den 2019-02-10 16:31, skrev "James Kass via Unicode" :

> 
> Philippe Verdy wrote,
> 
>>> ...[one font file having both italic and roman]...

For OpenType fonts, there is a "design axis" called "ital". Value 0 on that
axis would be roman (upright, normally), and value 1 on that axis would be
italic. I don't know to what extent that is available in OpenType fonts in
common use... (Instead of using two separate font files.)

[math chars]
> They were encoded for interoperability and round-tripping because they
> existed in character sets such as STIX. 

They were basically requested "by" STIX, yes. Not sure about the
round-tripping bit.

> They remain Latin letter form
> variants.  If they had been encoded as the variant forms which
> constitute their essential identity it would have broken the character
> vs. glyph encoding model of that era.  Arguing that they must not be
> used other than for scientific purposes

I don't think that particular argument was made, IIUC.

> is just so much semantic
> quibbling in order to justify their encoding.
> 
> Suppose we started using the double struck ASCII variants on this list
> in order to note Unicode character numbers such as 핌+픽피픽픽 or
> 핌+ퟚퟘퟞퟘ? 

That particular example would be ok (event though outside of a
conventional math formula). But we were talking about natural
languages in their conventional orthography, using italics/bold.

/Kent K





Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode

Den 2019-02-08 21:53, skrev "Doug Ewell via Unicode" :

> I'd like to propose encoding italics and similar display attributes in
> plain text using the following stateful mechanism:

Note that these do NOT nest (no stack...), just state changes for the
relevant PART of the "graphic" (i.e. style) state. So the approach in
that regard is quite different from the approach done in HTML/CSS.

> € Italics on: ESC [3m
> € Italics off: ESC [23m
> € Bold on: ESC [1m
> € Bold off: ESC [22m
> € Underline on: ESC [4m
(implies turning double underline off)

   Underline, double: ESC [21m
(implies turning single underline off)

> € Underline off: ESC [24m
> € Strikethrough on: ESC [9m
> € Strikethrough off: ESC [29m
> € Reverse on: ESC [7m
> € Reverse off: ESC [27m

"Reverse" = "switch background and foreground colours".

This is an (odd) colour thing. If you want to go with (full!) colour
(foreground and background), fine, but the "reverse" is oddball (and
based on what really old terminals were limited to when it comes to colour).

I'd rather include 'ESC [50m' (not variable spacing, i.e. "monospace" font)
and 'ESC [26m' (variable spacing, i.e. "proportional" font). Recall that
this is NOT for terminal emulators but for styling applied to text
outside of terminal emulators. (Terminal emulators already implement
much of this and more; albeit sometimes wrongly). This would be handy
for including (say) programming code or computer commands (or for that
matter, "ASCII art", or more generally "Unicode art") in otherwise
"ordinary"
text... (The "ordinary" text preferably set in a proportional font.)

> € Reset all attributes: ESC [m

(Actually 'ESC [0m', with the 0 default-able.) Handy, agreed, but not 100%
necessary.
These ESC-sequences should not normally be inserted "manually" but by a text
editor program, using the conventional means of "making bold" etc. (ctrl-b,
cmd-b,
"bold" in a menu); only "hackers" (in the positive sense) would actually
bother
about the command sequences as such.

/Kent K


> where ESC is U+001B.
>  
> This mechanism has existed for around 40 years and is already supported
> as widely as any new Unicode-only convention will ever be.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>  
> 



Re: Encoding italic

2019-02-09 Thread Kent Karlsson via Unicode


Den 2019-02-08 22:29, skrev "Egmont Koblinger via Unicode"
:

> (Mind you, I don't find it a good idea to add italic and whatnot
> formatting support to Unicode at all... but let's put aside that now.)

I don't think Doug mean to "add it to the Unicode standard", just to
have a summary of "handy esc-sequences (actually command-sequences)
for simple styling of text" picked from long-standing (text level...)
standards.

> There are a lot of problems with these escape sequences, and if you go
> for a potentially new standard, you might not want to carry these
> problems.
> 
> There is not a well-defined framework for escape sequences. In this
> particular case you might say it starts with ESC [ and ends with the
> letter 'm', but how do you know where to end the sequence if that
> letter 'm' just doesn't arrive? Terminal emulators have extremely

There is an overriding "basic (overall) syntax" for esc-seq/
command-sequences that do not include a string argument (like OSC,
APC, ...). IIUC it is (originally as byte sequences, but here as
character sequences):

\u001B[\u0020-\002F]*[\u0030-\007E]| 
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E] 

(no newline or carriage return in there). True, that has no direct
limit, but it would not be unreasonable to set a limit of (say)
max 30 characters. Potential (i.e. starting with ESC) esc-"sequences"
that do not match the overall syntax or are too long can simply be
rendered as is (except for the ESC itself). The esc/command sequences
(that match) but are not interpreted should be ignored in "normal"
(not "show invisibles" mode) display.

They are unlikely to be "default ignored" by such things as sorting
(and should preferably be filtered out beforehand, if possible). But
if we compare to other rich text editors, the command sequences should
be ignored by (interactive) searching, just like HTML tags are ignored
in interactive searching (the internal representation "skipping" the
HTML tags in one way or another). HTML tags should also (when text
known to be HTLM) filtered out before doing such things as sorting.

> complex tables for parsing (and still many of them get plenty of
> things wrong). It's unreasonable for any random small utility
> processing Unicode text to go into this business of recognizing all
> the well-known escape sequences, not even to the extent to know where
> they end. Whatever is designed should be much more easily parseable.
> Should you say "everything from ESC[ to m", you'll cause a whole bunch
> of problems when a different kind of escape sequence gets interpreted
> as Unicode.

The escape/command sequences would not be part of Unicode (standard).

> A parser, by the way, would also have to interpret combined sequences
> like ESC[3;0;1m or alike, for which I don't see a good reason as
> opposed to having separate sequences for each. Also, it should be

Formally covered by the (non-Unicode) standards, but optional (IIUC).

> carefully evaluated what to do with C1 (U+009B) instead of the C0 ESC[
> opening for an escape sequence ­ here terminal emulators vary. These
> just make everything even more cumbersome.
> 
> ECMA-48 8.3.117 specifies ESC[1m as "bold or increased intensity".

I think one should interpret these in a "modern" way, not looking
too much at what old terminals were limited to. (Colour ("increased
intensity") should be handled completely separately from bold.)

> Should this scheme be extended for colors, too? What to do with the
> legacy 8/16 as well as the 256-color extensions wrt. the color
> palette? Should Unicode go into the business of defining a fixed set
> of colors, or allow to alter the palette colors using the OSC 4 and
> friends escape sequences which supported by about half of the terminal
> emulators out there?

IF extending to colour, only refer to "true colour" (RGB) command-sequence.
The colour palette versions are for the limitations of (semi-)old terminals.

> For 256-colors and truecolors, there are two or three syntaxes out
> there regarding whether the separator is a colon or a semicolon.

It can only be colon. Using semicolon would interfere with the syntax
for multiple style specifications in one command sequence. (I by mistake
wrote a semicolon there in an earlier post; sorry.)

> Some terminal emulators have made up some new SGR modes, e.g. ESC[4:3m
> for curly underline. What to do with them? Where to draw the line what

(Note colon, not semicolon, as separator.) Possible, partially matching
the capabilities for underlining via CSS (solid, dotted, dashed, wavy,
double). Depends on how much styling options one wants to pick up.

> to add to Unicode and what not to? Will Unicode possibly be a

I don't think anyone wants to make this part of the Unicode standard.
(A the most a Unicode technical note...; from Unicode's point of view.)

[...] 
> What to do with things that Unicode might also want to have, but
> doesn't exist in terminal emulators due to their nature, such as
> switching 

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode


Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode"
:

> On Sat, 02 Feb 2019 14:01:46 +0100
> Kent Karlsson via Unicode  wrote:
> 
>> Well, I guess you may need to put some (practical) limit to the number
>> of non-spacing marks (like max two above + max one below; overstrikes
>> are an edge case). Otherwise one may need to either increase the line
>> height (bad idea for a terminal emulator I think) or the marks start
>> to visually interfere with text on other lines (even with the hinted
>> limits there may be some interference), also a bad idea for a terminal
>> emulator. So I'm not so sure that non-spacing marks is a piece of
>> cake... (I.e., need to limit them.)
> 
> Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the
> lamedh?  The depth then is the maximum depth, not the sum of the
> depths. 

Do you want to view/edit such texts on a terminal emulator? (Rather
than a GUI window.)
 
> Tai Lue has 'mai sat 3 lem' - that's three marks above for a
> combination common enough to have a name.  Throw in the repetition mark
> and that's four marks above if you treat the subscript consonant as a
> mark (or code it to comply with the USE's erroneous grammar).

I don't question that as such. But again, do you want to view/edit such
texts on a **terminal emulator**?

It is just that such things are likely to graphically overflow the
"cell" boundaries, unless the cells are disproportionately high (i.e.
double or so line spacing). Doesn't really sound like a terminal
emulator... I do not think terminal emulators should be used for
ALL kinds of text.

/Kent K




Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode


Den 2019-02-02 12:17, skrev "Egmont Koblinger" :

> the font. It's taken from EastAsianWidth (or other means, which we're
> working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9

Yes, that too:
FE0F ? VARIATION SELECTOR-16 = emoji variation selector

But the issue you refer to only deals with U+FE0F. There is also U+FE0E:
FE0E ? VARIATION SELECTOR-15 = text variation selector
which can make a character that is "default emoji" (which are wide)
into "text variant", often single-width, for instance:
1F315 FE0E ; text style;  # (6.0) FULL MOON SYMBOL

---

>> Likewise non-spacing combining characters should
>> be possible to deal reasonably with.
> 
> Most terminal emulators handle non-spacing combining marks, it's a
> piece of cake. (Spacing marks are more problematic.)

Well, I guess you may need to put some (practical) limit to the number
of non-spacing marks (like max two above + max one below; overstrikes
are an edge case). Otherwise one may need to either increase the line
height (bad idea for a terminal emulator I think) or the marks start
to visually interfere with text on other lines (even with the hinted
limits there may be some interference), also a bad idea for a terminal
emulator. So I'm not so sure that non-spacing marks is a piece of cake...
(I.e., need to limit them.)

/Kent K




Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Kent Karlsson via Unicode


Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode"
:

> On Fri, 1 Feb 2019 13:02:45 +0200
> Khaled Hosny via Unicode  wrote:
> 
>> On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via
>> Unicode wrote:
>>> On Thu, 31 Jan 2019 12:46:48 +0100
>>> Egmont Koblinger  wrote:
>>> 
>>> No.  How many cells do CJK ideographs occupy?  We've had a strong
>>> hint that a medial BEH should occupy one cell, while an isolated
>>> BEH should occupy two.
>> 
>> Monospaced Arabic fonts (there are not that many of them) are designed
>> so that all forms occupy just one cell (most even including the
>> mandatory lam-alef ligatures), unlike CJK fonts.
>> 
>> I can imagine the terminal restricting itself to monspaced fonts,
>> disable ³liga² feature just in case, and expect the font to well
>> behave. Any other magic is likely to fail.
> 
> Of course, strictly speaking, a monospaced font cannot support harakat
> as Egmont has proposed.
> 
> Richard.

(harakat: non-spacing vowel mark in Arabic)

"Monospaced font" is really a concept with modification. Even for
"plain old ASCII" there are two advance widths, not just one: 0 for
control characters (and escape/control sequences, neither of which
should directly consult the font; even such things as OSC sequences,
but the latter are a bad idea to have in any line one might wish to
edit (vi/emacs/...) via a terminal emulator window). But terminals
(read terminal emulators) can deal with mixed single width and double
width characters (which is, IIUC, the motivation for the datafile
EastAsianWidth.txt). Likewise non-spacing combining characters should
be possible to deal reasonably with.

It is a lot more difficult to deal with BiDi in a terminal emulator,
also shaping may be hard to do, as well as reordering (or even
splitting) combining characters. All sorts of problems arise; feeding
the emulator a character (or "short" strings) at a time not allowed
to buffer for display (causing reshaping or movement of already
displayed characters, edit position movement even within a single
line, etc.). Even if solvable for a "GUI" text editor (not via a
terminal), they do not seem to be workable in a terminal (emulator)
setting. Esp. not if one also wants to support multiline editing
(vi/emacs/...) or even single-line editing.

As long as editing is limited to a single line (such as the system
line editor, or an "enhanced functionality" line editor (such as
that used for bash; moving in the history sets the edit position
at EOL) even variable width ("proportional) fonts should not pose
a major problem. But for multiline editors (à la vi/emacs) it would
not be possible to synch nicely (unless one accepts strange jums)
the visual edit position and the actual edit position in the edit
buffer: The program would not have access to the advance width data
from the font that the terminal emulator uses, unless one
revolutionise what terminal emulators do... (And I don't see a
case for doing that.) But both a terminal emulator and multiline
editing programs (for terminal emulators) still can have access
to EastAsianWidth data as well as which characters are non-spacing;
those are not font dependent. (There might be some glitches if
the Unicode versions used do not match (the terminal emulator
and the program being run are most often on different systems),
but only for characters where these properties have changed,
e.g. newly allocated non-spacing marks.)

/Kent K

PS
No, I have not done extensive testing of various terminal emulators
on how well the handle the stuff above.





Re: Encoding italic

2019-01-30 Thread Kent Karlsson via Unicode
I did say "multiple" and "for instance". But since you ask:

ITU T.416/ISO/IEC 8613-6 defines general RGB & CMY(K) colour control
sequences, which are deferred in ECMA-48/ISO 6429. (The RGB one
is implemented in Cygwin (sorry for mentioning a product name).)
(The "named" ones, though very popular in terminal emulators, are
all much too stark, I think, and the exact colour for them are
implementation defined.)

ECMA-48/ISO 6429 defines control sequences for CJK emphasising, which
traditionally does not use bold or italic. Compare those specified for CSS
(https://www.w3.org/TR/css-text-decor-3/#propdef-text-decoration-style and
https://www.w3.org/TR/css-text-decor-3/#propdef-text-emphasis-style).
These are not at all mentioned in ITU T.416/ISO/IEC 8613-6, but should
be of interest for the generalised subject of this thread.

There are some other differences as well, but those are the major ones
with regard to text styling. (I don't know those standards to a tee.
I've just looked at the "m" control sequences for text styling. And yes,
I looked at the free copies...)

/Kent Karlsson

PS
If people insist on that EACH character in "plain text" italic/bold/etc
"controls" be default ignorable: one could just take the control sequences
as specified, but map the printable characters part to the corresponding
tag characters... Not that I think that that is really necessary.


Den 2019-01-30 22:24, skrev "Doug Ewell via Unicode" :

> Kent Karlsson wrote:
>  
>> Yes, great. But as I've said, we've ALREADY got a
>> default-ignorable-in-display (if implemented right)
>> way of doing such things.
>> 
>> And not only do we already have one, but it is also
>> standardised in multiple standards from different
>> standards institutions. See for instance "ISO/IEC 8613-6,
>> Information technology --- Open Document Architecture (ODA)
>> and Interchange Format: Character content architecture".
>  
> I looked at ITU T.416, which I believe is equivalent to ISO 8613-6 but
> has the advantage of not costing me USD 179, and it looks very similar
> to ISO 6429 (ECMA-48, formerly ANSI X3.64) with regard to the things we
> are talking about: setting text display properties such as bold and
> italics by means of escape sequences.
>  
> Can you explain how ISO 8613-6 differs from ISO 6429 for what we are
> doing, and if it does not, why we should not simply refer to the more
> familiar 6429?
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 




Re: Encoding italic

2019-01-29 Thread Kent Karlsson via Unicode
Yes, great. But as I've said, we've ALREADY got a
default-ignorable-in-display (if implemented right)
way of doing such things.

And not only do we already have one, but it is also
standardised in multiple standards from different
standards institutions. See for instance "ISO/IEC 8613-6,
Information technology --- Open Document Architecture (ODA)
and Interchange Format: Character content architecture".
(In a little experiment I found that it seems that
Cygwin is one of the better implementations of this;
B.t.w. I have no relation to Cygwin other than using it.)

To boot, it's been around for decades and is still
alive and well. I see absolutely no need for a "bold"
new concept here; the one below is not better in any
significant way.

/Kent Karlsson


Den 2019-01-29 23:35, skrev "Andrew West via Unicode" :

> On Mon, 28 Jan 2019 at 01:55, James Kass via Unicode
>  wrote:
>> 
>> This bold new concept was not mine.  When I tested it
>> here, I was using the tag encoding recommended by the developer.
> 
> Congratulations James, you've successfully interchanged tag-styled
> plain text over the internet with no adverse side effects. I copied
> your email into BabelPad and your "bold" is shown bold (see attached
> screenshot).
> 
> Andrew





Re: Encoding italic

2019-01-28 Thread Kent Karlsson via Unicode


Den 2019-01-28 02:53, skrev "James Kass via Unicode" :

> plain-text and are uncomfortable using the math alphanumerics for this,
> although the math alphanumerics seem well qualified for the purpose. 

It "works" basically only for English (note that any diacritics would be
placed suitable for math, not for words, and then there are Latin letters
that do not have a decomposition (like ø), and then there is of course
Cyrillic, and a whole slew of non-Latin scripts. So, no, they do NOT AT
ALL "seem well qualified". And... We already have a well-established
standard for doing this kind of things...

/Kent K





Re: Encoding italic

2019-01-27 Thread Kent Karlsson via Unicode
Apart from that control sequences for (some) styling is standardised
(since decades by now), and the "tag characters" approach is not:

For the control sequences for styling, there is no pretence of nesting,
just setting/unsetting an aspect of styling. For  etc. (in tag
characters) there is at least the pretence/appearance of nesting, even
if the interpreter doesn't actually care about nesting (and just interprets
them as set/unset). (In addition,  etc. in "real" HTML are
1) disrecommended, and
2) the actual styling comes from a style sheet (and the **default**
one makes  stuff bold).)

/Kent K


Den 2019-01-27 21:03, skrev "James Kass via Unicode" :

> 
> A new beta of BabelPad has been released which enables input, storing,
> and display of italics, bold, strikethrough, and underline in plain-text
> using the tag characters method described earlier in this thread.  This
> enhancement is described in the release notes linked on this download page:
> 
> http://www.babelstone.co.uk/Software/index.html
> 





Re: Encoding italic (was: A last missing link)

2019-01-24 Thread Kent Karlsson via Unicode


Den 2019-01-24 03:21, skrev "Mark E. Shoulson via Unicode"
:

> On 1/22/19 6:26 PM, Kent Karlsson via Unicode wrote:
>> Ok. One thing to note is that escape sequences (including control sequences,
>> for those who care to distinguish those) probably should be "default
>> ignorable" for display. Requiring, or even recommending, them to be default
>> ignorable for other processing (like sorting, searching, and other things)
>> may be a tall order. So, for display, (maximal) substrings that match:
>> 
>> \u001B[\u0020-\002F]*[\u0030-\007E]|
>> (\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]
>> 
>> should be default ignorable (i.e. invisible, but a "show invisibles" mode
>> would show them; not interpreted ones should be kept, even if interpreted
>> ones need not, just (re)generated on save). That is as far as Unicode
>> should go.
> 
> So it isn't just "these characters should be default ignorable", but
> "this regular expression is default ignorable."  This gets back to
> "things that span more than a character" again, only this time the
> "span" isn't the text being styled, it's the annotation to style it. 

True. That is how ECMA/ISO/ANSI escape/control-sequences are designed.
Had they not already been designed, and implemented, but we were to do
a design today, it would surely be done differently; e.g. having
"controls" that consisted only of (individually) "default-ignorable"
characters.

But, and this is the important thing here:

a) The current esc/control-sequences is an accepted standard,
since long.

b) This standard is still in very much active use, albeit mostly
by terminal emulators. But the styling stuff need not at all
be limited to terminal emulators.

Since it is an actively and widely used standard, I don't see the
point of trying to design another way of specifying "default
ignorable"-controls for text styling. (HTML, for instance, does not
have "default ignorable" controls, since ALL characters in the
"controls" are printable characters, so one needs a "second level"
for parsing the controls.) True, ignoring or interpreting an
esc/control-sequence requires some processing of substrings, since
some (all but the first) are printable characters. But not that hard.
It has been implemented over and over...

Had this standard been defunct, then there would be an opportunity
to design something different.


> The "bash" shell has special escape-sequences (\[ and \]) to use in
> defining its prompt that tell the system that the text enclosed by them
> is not rendered and should not be counted when it comes to doing

Never heard of. Cannot find any reference mentioning them. Reference?


> cursor-control and line-editing stuff (so you put them around, yep, the
> escape sequences for coloring or boldfacing or whatever that you want in
> your prompt). 


Line editing stuff in bash is done on an internal buffer (there is a library
for doing this, and that library can be used by various other command line
programs; bash does not use the system input line editing). Then that
library tries to show what is in the buffer on the terminal. So, I'm
not sure what you are talking about; bash does NOT (somehow) scrape
the screen (terminal emulator window).


Furthermore, colouring and bold/underline is quite common not only in
prompts, but also in output directed at a terminal from various programs.
(And it works just fine.) Unfortunately cut-and-paste tends to loose
much (or all) of that. (Would be nicer if it got converted to HTML,
RTF, .doc, or whatever is the target format; or just nicely kept if
"plain text" is the target.)

 
> That would seem to be at least simpler than a big ol'
> regexp, but really not that much of an improvement.  It also goes to
> show how things like this require all kinds of special handling,
> even/especially in a "simple" shell prompt (which could make a strong
> case for being "plain text", though, yes, terminal escape codes are a
> thing.)

They are NOT "terminal escape codes". It is just that, for now, it is
just about only terminal emulator that implement esc/control-sequences.
>From https://www.ecma-international.org/publications/standards/Ecma-048.htm:
"The control functions are intended to be used embedded in character-coded
data for interchange, in particular with character-imaging devices."
A (plain) text editor is an example of a 'character-imaging device'.
(Yes, the terminology is a bit dated.)

/Kent K

> 
> ~mark





Re: Encoding italic (was: A last missing link)

2019-01-22 Thread Kent Karlsson via Unicode
Ok. One thing to note is that escape sequences (including control sequences,
for those who care to distinguish those) probably should be "default
ignorable" for display. Requiring, or even recommending, them to be default
ignorable for other processing (like sorting, searching, and other things)
may be a tall order. So, for display, (maximal) substrings that match:

\u001B[\u0020-\002F]*[\u0030-\007E]|
(\u001B'['|\009B)[\u0030-\003F]*[\u0020-\002F]*[\u0040-\007E]

should be default ignorable (i.e. invisible, but a "show invisibles" mode
would show them; not interpreted ones should be kept, even if interpreted
ones need not, just (re)generated on save). That is as far as Unicode
should go.

Some may be interpreted, this thread focuses on italic, but also bold
and underlined. There is a whole bunch of "style" control sequences
(those that have "m" at the end of the sequence) specified, and terminal
emulators implement several of them, but not all.

As for editing, if "style" control sequences à la ISO 6429 were to be
supported in text editors, I would NOT expect users to type in those
escape/control sequences in any way, but use "ctrl/command-i" (etc.) or
menu commands as editors do now, and the representation as esc-sequences
be kept under wraps (and maybe only present in files, not in the internal
representation during editing), and not seen unless one starts to analyse
the byte sequences in files. So, even if you don't like this esc-sequence
business:
1) It would not be seen by most users, mostly by programmers (the same
goes for other ways of representing this, be it HTML, .doc, or whatever.
2) It is already standardised, and one can make (a slightly inaccurate)
argument that it is "plain text".

What one would need to do is:
1) Prioritise which "style" control sequences should be interpreted
(rather than be ignored).
2) Lobby to "plain" text editor makers to support those styles,
representing them (in files) as standard control sequences.

A selection of already standardised style codes (i.e., for control
sequences that end in ²m²):
 
0   default rendition (implementation-defined)

1   bold
(2  lean)
22  normal intensity (neither bold nor lean)

3   italicized
23  not italicized (i.e. upright)

4   singly underlined
(21 doubly underlined)
24  not underlined (neither singly nor doubly)

(9  crossed-out (strikethrough))
(29 not crossed out)

If you really want to go for colour as well (RGB values in 0‹255)
(colour is popular in terminal emulators...):
 
(30-37  foreground: black, red, green, yellow, blue, magenta, cyan, white)
38  foreground colour as RGB. Next arguments 2;r;g;b
39  default foreground colour (implementation-defined)

(40-47  background: black, red, green, yellow, blue, magenta, cyan, white)
48  background colour as RGB. Next arguments 2;r;g;b
49  default background colour (implementation-defined)

There are some more (including some that assume a small font palette, for
changing font). But far enough for now. Maybe too far already. But do not
allow interpreting multiple style attribute codes in one control sequence;
quite unnecessary.


/Kent K



Den 2019-01-21 21:46, skrev "Doug Ewell via Unicode" :

> Kent Karlsson wrote:
> 
>> There is already a standardised, "character level" (well, it is from
>> a character standard, though a more modern view would be that it is
>> a higher level protocol) way of specifying italics (and bold, and
>> underline, and more):
>> 
>> \u001b[3mbla bla bla\u001b[0m
>> 
>> Terminal emulators implement some such escape sequences.
> 
> And indeed, the forthcoming Unicode Technical Note we are going to be
> writing to supplement the introduction of the characters in L2/19-025,
> whether next year or later, will recommend ISO 6429 sequences like this
> to implement features like background and foreground colors, inverse
> video, and more, which are not available as plain-text characters.
>  
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
> 





Re: Encoding italic (was: A last missing link)

2019-01-19 Thread Kent Karlsson via Unicode
(I have skipped some messages in this thread, so maybe the following
has been pointed out already. Apologies for this message if so.)

You will not like this... But...

There is already a standardised, "character level" (well, it is from
a character standard, though a more modern view would be that it is
a higher level protocol) way of specifying italics (and bold, and
underline, and more):

\u001b[3mbla bla bla\u001b[0m

Terminal emulators implement some such escape sequences. The terminaI
emulators I use support bold (1 after the [) but not italic (3). Every time
you
use the "man"-command in a Linux/Unix/similar terminal you "use" the
escape sequences for bold and underline... Other terminal based programs
often use bold as well as colour esc-sequences for emphasis as well as for
warning/error messages, and other "hints" of various kinds. For xterm,
see: https://www.xfree86.org/4.8.0/ctlseqs.html.

So I don't see these esc-sequences becoming obsolete any time soon.
But I don't foresee them being supported outside of terminal emulators
either... (Though for style esc-sequences it would certainly be possible.
And a "smart" cut-and-paste operation could auto-insert an esc-sequence
that sets the the style after the paste to the one before the paste...)

Had HTML (somehow, magically) been invented before terminals, maybe
terminals (terminal emulators) would have used some kind of "mini-HTML"
instead. But things are like they are on that point.

/Kent Karlsson

PS
The cut-and-paste I used here convert (imperfectly: bold is lost and
spurious ! inserted) to HTML
(surely going through some internal attribute-based representation, the HTML
being generated
when I press send):

man(1) 
man(1)

NAME
   man - format and display the on-line manual pages

SYNOPSIS
   man  [-acdfFhkKtwW]  [--path]  [-m system] [-p string] [-C
config_file]
   [-M pathlist] [-P pager] [-B browser] [-H htmlpager] [-S
section_list]
   [section] name ...






Den 2019-01-18 20:18, skrev "Asmus Freytag via Unicode"
:

>
> 
> I would full agree and I think Mark puts it really well in the message below
> why some of the proposals brandished here are no longer plain text but
> "not-so-plain" text.
>  
> 
> I think we are better served with a solution that provides some form of
> "light" rich text, for basic emphasis in short messages. The proper way for
> this would be some form of MarkDown standard shared across vendors, and
> perhaps implemented in a way that users don't necessarily need to type
> anything special, but that, if exported to "true" plain text, it turns into
> the source format for the "light" rich text.
>  
> 
> This is an effort that's out of scope for Unicode to implement, or, I should
> say, if the Consortium were to take it on, it would be a separate technical
> standard from The Unicode Standard.
>  
>  
> 
> A./
>  
> 
> PS: I really hate the creeping expansion of pseudo-encoding via VS characters.
> The only worse thing is adding novel control functions.
>  
>  
> 
>  
>  
> On 1/18/2019 7:51 AM, Mark E. Shoulson via Unicode wrote:
>  
>  
>> On 1/16/19 6:23 AM, Victor Gaultney via Unicode wrote:
>>  
>>>  
>>>  Encoding 'begin italic' and 'end italic' would introduce difficulties when
>>> partial strings are moved, etc. But that's no different than with current
>>> punctuation. If you select the second half of a string that includes an end
>>> quote character you end up with a mismatched pair, with the same problems of
>>> interpretation as selecting the second half of a string including an 'end
>>> italic' character. Apps have to deal with it, and do, as in code editors.
>>>  
>>>  
>>  It kinda IS different.  If you paste in half a string, you get a mismatched
>> or unmatched paren or quote or something.  A typo, but a transient one.  It
>> looks bad where it is, but everything else is unaffected.  It's no worse than
>> hitting an extra key by mistake. If you paste in a "begin italic" and miss
>> the "end italic", though, then *all* your text from that point on is
>> affected!  (Or maybe "all until a newline" or some other stopgap ending, but
>> that's just damage-control, not damage-prevention.)  Suddenly, letters and
>> symbols five words/lines/paragraphs/pages look different, the pagination is
>> all altered (by far more than merely a single extra punctuation mark, since
>> italic fonts generally are narrower than roman).  It's a disaster.
>>  
>>  No.  This kind of statefulness really is beyond what Unicode is designed to
>> cope with.  Bidi controls are (almost?) the sole exception, and even they
>> cause their share of headaches.  Encoding separate _text_ italics/bold is IMO
>> also a disastrous idea, but I'm not putting out reasons for that now.  The
>> only really feasible suggestion I've heard is using a VS in some fashion.
>> (Maybe let it affect whole words instead of individual characters?  Makes for
>> fewer noisy VSs, but introduces a whole other host of limitations (how 

Re: Unicode 11 Georgian uppercase vs. fonts

2018-07-28 Thread Kent Karlsson via Unicode
I know it is too late now, but... Could have added the characters,
without adding the case mappings. Just as it was done for the LATIN
CAPITAL LETTER SHARP S (ẞ), where the proper case mapping was relegated
to "special purpose software" (or just a special setting in common
software). The (proper) case-mapping for ẞ is nowhere to be found the
Unicode database (which I think is a pity, but that is a different matter).

I think "specialcasing.txt" is not really maintained anymore, but I'll
disregard that here.

One could add a special-casing for each modern Georgian lowercase letter
to (continue to) uppercase-map to itself (for the Georgian language at
least).

/Kent K



Den 2018-07-28 15:26, skrev "Michael Everson via Unicode"
:

> Mtavruli could not be represented in the UCS before we added these characters.
> Now it can. 
> 
> Michael Everson
> 
>> On 28 Jul 2018, at 14:10, Richard Wordingham via Unicode
>>  wrote:
>> 
>> On Sat, 28 Jul 2018 01:45:53 +
>> Peter Constable via Unicode  wrote:
>> 
>>> (iii) gave
>>> indication of intent to develop a plan of action for preparing their
>>> institutions for this change as well as communicating that within
>>> Georgian industry and society. It was only after that did UTC feel it
>>> was viable to proceed with encoding Mtavruli characters.
>> 
>> It is dangerous to rely on declarations of intent when making
>> irreversible decisions.  The UTC should have learnt that from the
>> Mongolian mess.
>> 
>> Richard.
> 
> 





Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Kent Karlsson via Unicode

Den 2017-04-12 06:12, skrev "Garth Wallace" :

> Shogi diagrams are uncheckered (as Shogi boards are), with grid-lines to
> separate the spaces; traditionally, chess diagrams use the contrast of dark
> and light squares to distinguish spaces with no grid lines. They may, but do
> not have to, have dots at some intersections (these mark starting and
> promotion zones). Graphical diagrams may show images of pieces (pentagonal,
> with names written in kanji), but typeset diagrams use abbreviations of the
> piece names as CJK ideographs or kana: e.g. the gold general is 金, and the
> promoted pawn is と. Instead of "black" and "white", the pieces belonging to
> the sente player are displayed upright and those belonging to the gote player
> are rotated 180°. Any proposal for Shogi would have to deal with that.

OT

Unicode has (only) these for Shogi pieces:

2616;WHITE SHOGI PIECE;So;0;ON;N;
2617;BLACK SHOGI PIECE;So;0;ON;N;
26C9;TURNED WHITE SHOGI PIECE;So;0;ON;N;
26CA;TURNED BLACK SHOGI PIECE;So;0;ON;N;

Which seems insufficient...

/Kent K



Re: Proposal to add standardized variation sequences for chess notation

2017-04-12 Thread Kent Karlsson via Unicode

Den 2017-04-12 05:14, skrev "Garth Wallace" :

> One salient feature the Block Elements have that the Box Drawing characters do
> not: distinct LEFT and RIGHT verticals, and LOWER and UPPER horizontals. The
> double frame typically consists of a thin line and a thicker line, with one on
> the inside and one on the outside, so left and right verticals are not
> interchangeable. Even when a single frame is used, it is important for
> spacing, since the frame should be flush against the board.

Note that I used TWO DIFFERENT variation selectors for the horizontal and
vertical box drawing characters in my suggestion (marked in bold here):

2500 FE00; Chessboard box drawing (top); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2500 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS LIGHT HORIZONTAL
(U+2500)
2502 FE00; Chessboard box drawing (left); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
2502 FE01; Chessboard box drawing (right); # BOX DRAWINGS LIGHT VERTICAL
(U+2502)
250C FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND RIGHT
(U+250C)
2510 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT DOWN AND LEFT
(U+2510)
2514 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND RIGHT
(U+2514)
2518 FE00; Chessboard box drawing; # BOX DRAWINGS LIGHT UP AND LEFT (U+2518)

2550 FE00; Chessboard box drawing (top); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2550 FE01; Chessboard box drawing (bottom); # BOX DRAWINGS DOUBLE HORIZONTAL
(U+2550)
2551 FE00; Chessboard box drawing (left); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2551 FE01; Chessboard box drawing (right); # BOX DRAWINGS DOUBLE VERTICAL
(U+2551)
2554 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND RIGHT
(U+2554)
2557 FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE DOWN AND LEFT
(U+2557)
255A FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND RIGHT
(U+255A)
255D FE00; Chessboard box drawing; # BOX DRAWINGS DOUBLE UP AND LEFT
(U+255D)

/Kent K



Re: Proposal to add standardized variation sequences for chess notation

2017-04-11 Thread Kent Karlsson via Unicode

Den 2017-04-10 12:19, skrev "Michael Everson" :

> I believe the box drawing characters are for drawing boxes

Which is exactly what you are doing.

> and grids on 
> computer terminals, which is not the same thing as scoring a line around a set
> of 64 graphic images.

No, that is why I put in variation selectors. The glyphic variation
selected would in my judgement fall well within the "box drawing semantics"
(if you like) of these characters.

In addition, thinking ahead, it is not at all unlikely that someone
might want to divide a chess board with a horizontal mid-line, or for
that matter a vertical mid-line (e.g. for "double chess"), or even
quadrants. And then, ta-da, there are already box-drawing characters for
doing just that (even when there is a small gap between the board and the
border. (I'm not suggesting to add variation selector sequences for /those/
box drawing characters, because I don't /know/ there is a use-case for
mid-lines in chess board layout, but I'm saying there might be.)

> I don¹t want to get mixed up in using the box-drawing
> characters. The characters which I have chosen work fine and to my mind suit
> the application better.

They "work" (of course), no font renderer or font editor is "smart" enough
to "see" that you are going quite a bit (in my judgement) outside of the
acceptable glyph variability for the characters you (so far) opted for
for chess box drawing. (Other relevant, and non-glyph, properties being
the same between the box drawing and block chars.)

That the "block characters" are pure crap (which they are), does not
mean that you can co-opt them for (slightly) "variant" box drawing.

> I also don¹t want to complicate chess fonts by having to have multiple choices
> within a font for bordering. For one thing, single-rule and double-rule
> bordering is by no means the gamut of possibility.

You are not wanting "emoji" style borders, I'm sure. But some slight
"ornate" style would be fine for the "box drawing" chars (even without
variation selectors). The "single" should still be single, though,
and the "double" be double. So triple (etc.) is out.

I think single/double line border should be a decision by the "author"/
"editor", and not the font maker. Imagine accompanying text saying
"the double bordered one is ".

> Chess fonts do not have to be swiss-army knives.

I don't see that I have asked for that.

B.t.w., I see you don't have 1-8, a-h labels on the boards... It might be
worth mentioning that FULLWIDTH a-h should work fine as labels (them being
em-wide).

/Kent K