Re: Encoding italic

2019-02-05 Thread Richard Wordingham via Unicode
On Tue, 5 Feb 2019 16:01:41 +
Andrew West via Unicode  wrote:

> You would
> have to first convert any text to be italicized to NFD, then apply
> VS14 to each non-combining character. This alone would make a VS
> solution unacceptable in my opinion.

What is so unacceptable about having to do this?

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-05 Thread Philippe Verdy via Unicode
I think that before making any decision we must make some decision about
what we mean by "newlines". There are in fact 3 different functions:
- (1) soft line breaks (which are used to enforce a maximum display width
between paragraph margins): these are equivalent to breakable and
compressible whitespaces, and do not change the logical paragraph
direction, they don't insert any additionnal vertical gap between lines, so
the logicial line-height is preserved and continues uninterrupted. If text
justification applies, this whitespace will be entirely collapsed into the
end margin, and any text before it will stilol be justified to match the
end margin (until the maximum expansion of other whitespaces in the middle
is reached, and the maximum intercharacter gap is also reached (in which
case, that line will not longer be expanded more), but this does not apply
to terminal emulators that noramlly never use text justification, so the
text will just be aligned to the start margin and whitespaces before it on
the same line are preserved, and collapsed only at end of the line (just
before the soft line break itself)
- (2) hard line breaks: they break to a new line but continue the paragraph
within its same logical direction, but they are not compressible
whitespaces (and do not depend on the logical end margin of the paragraph.
- (3) paragraph breaks: generally they introduce an addition vertical gap
with top and bottom margins

The problem in terminals is that they usually cannot distinguish types (1)
and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL.
Type (1) is only existing within the framework of a higher level protocol
which gives additional interpretation to these "newlines". The special
control LS is almost never used but may be used for type (1) i.e. soft
line-breaks, and will fallback to type (2) which is represented by the
legacy "simple" newlines (single CR, or single LF, or single CR+LF, or
single NEL). I have seen very little or no use of the LS (line separator)
special control.

Type (3) may be encoded with PS (paragraph separator), but in terminals
(and common protocols line MIME) it is usually encoded using a couple of
newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with
additional whitespaces (and additional presentation characters such as ">"
in quotations inserted in mail responses) between them (needed for MIME and
HTTP) which may be collapsed when rendering or interpreting them.

Some terminal protocols can also use other legacy ASCII separators such as
FS, GS, RS, US for grouping units containing multiple paragraphs, or
STX/EOT pairs for encapsulating whole text documents in an
protocol-specific enveloppe format (and will also use some escaping
mechanism for special controls found in the middle, such as DLE+control to
escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or
DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal
digits. There's a wide variety of escaping mechanisms used by various
higher-layer protocols (including transport protocols or encoding syntaxes
used just below the plain-text layer, in a lower layer than the transport
protocol layer).

Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode 
a écrit :

> > Date: Mon, 4 Feb 2019 19:45:13 +
> > From: Richard Wordingham via Unicode 
> >
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> > choose how far apart their starting margins are.  I think that could
> > get complicated for plain text if the terminal has unbounded width.
>
> But no real-life terminal does.  The width is always bounded.
>


Re: Bidi paragraph direction in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 02:28:50 +0100
> Cc: unicode@unicode.org
> 
> I have to admit, I'm not an Emacs user, I only have some vague ideas
> how powerful a tool it is. But in its very core I still believe it's a
> text editor – is it fair to say this? It could be used for example to
> conveniently create TUTORIAL.he.

It is a text editing/processing environment which has a lot of
text-based applications built on top of it.  It could (and was) used
to create TUTORIAL.he, but it can and is used for much more.

> There are plenty of line-oriented tools.
> [...]

Actually, for every utility you mention, Emacs has a command that
either invokes the utility and presents its output, or does the same
job by using built-in features.  So most/all of the jobs you mention
are routinely done in Emacs.  After all, Emacs is a programmer's
editor at its core, so every job programmers routinely do from the
shell prompt has an equivalent feature in Emacs.  You can even run
shells inside Emacs, with Emacs serving as a terminal emulator (which
then supports bidi ;-).

> There are just sooo many use cases, it's impossible to perfectly
> address all of them at once.

I don't think you need to look for a perfect solution.  You need to
look for one that works reasonably well in practice.  It is my
experience in Emacs that the empty line as paragraph delimiter
produces much better results than if you treat each line as a separate
paragraph.  We do have in Emacs features that allow to override the
default paragraph direction, but experience shows that they are used
relatively rarely.

> I'm confident that my specification which says that it should be
> preserved as a 100 character long paragraph and passed to BiDi
> accordingly is already a significant step forward.

I agree, but I urge you to make one more step, which IME is really
necessary.


Re: Encoding italic

2019-02-05 Thread Andrew West via Unicode
On Tue, 5 Feb 2019 at 15:34, wjgo_10...@btinternet.com via Unicode
 wrote:
>
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

Just reminding you that "The initial character in a variation sequence
is never a nonspacing combining mark (gc=Mn) or a canonical
decomposable character" (The Unicode Standard 11.0 §23.4). This means
that a variation sequence cannot be defined for any precomposed
letters and diacritics, so for example you could not italicize the
word "fête" by simply adding VS14 after each letter because "ê" (in
NFC form) cannot act as the base for a variation sequence. You would
have to first convert any text to be italicized to NFD, then apply
VS14 to each non-combining character. This alone would make a VS
solution unacceptable in my opinion.

Andrew



Re: Bidi paragraph direction in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 01:32:34 +0100
> Cc: unicode@unicode.org
> 
> On the other hand, it's not unreasonable for higher level stuff (e.g.
> shell scripts, or tools like "zip") to use such control characters.

Yes, but most of them won't ever do that.

> > No, this simple case must work reasonably well with the application
> > _completely_ oblivious to the bidi aspects. If this can't work
> > reasonably well, I submit that the entire concept of having a
> > bidi-aware terminal emulator doesn't "hold water".
> 
> There isn't a magic wand. I can't magically fix every BiDi stuff by
> changing the terminal emulator's source code.

I didn't say "magically fix", I said "work reasonably well".  I think
it would be a mistake to demand that any alternative to the default
each-line-is-a-new-paragraph method must be perfect.  It should be
enough if an alternative is better.

> What my specification essentially modifies is that with this
> specification, you at least will have a chance to get the mode right.

My experience is that this is an important feature to have, but it
will (maybe even should) be used rather rarely.  In most cases you
will just have plain text.

Moreover, emitting the control sequences that set the mode is in
itself a complication, because if the terminal doesn't support them,
the result could be corrupted display.  You will need methods of
detecting the support, and those detection methods usually involve
sending another control sequence to the terminal and waiting for
response, something that complicates applications and causes delays in
displaying output.

> In case of "zip", the creators of that software know exactly how the
> output should look like

Not necessarily true.  The translations are normally prepared by
people who are experts only in translating messages, they don't
necessarily consider layout issues, because for that you'd need to
look at the code or even run the program, something translators are
unlikely to do.

> If you're about to internationalize your software, this layout is a
> pretty bad choice.

Tell me about that!

But the reality is that this is what you get, and IMO the solution for
displaying this on a terminal should work reasonably well with that.

> This kind of formatting also ignores that English is a pretty dense
> language, in other languages the strings tend to become longer.

Actually, some/many RTL scripts tend to produce shorter text, because
vowels are not written, and because many words have very short roots.
But this is a tangent.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> Date: Tue, 5 Feb 2019 00:05:47 +
> From: Richard Wordingham via Unicode 
> 
> > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > > by paragraph separator characters. This means characters whose bidi
> > > category is B, which includes Newline, the CR-LF pair on Windows,
> > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.
> 
> It actually gives two different definitions. Table UAX#9 4 restricts
> the type B to *appropriate newline functions; not all newlines are
> paragraph separators.

For what exactly is "appropriate newline function" one should read the
Unicode Standard, section 5.8.  My conclusions from that are different
from yours; see below.

> > Indeed, this was an oversight on my side. So, with this definition,
> > every single newline character starts a new paragraph. The result of
> > printf "Hello\nWorld\n" > world.txt
> > is a text file consisting of two paragraphs, with 5 characters in
> > each. Correct?
> 
> No, it depends on when a newline function is 'appropriate'. TUS 5.8
> Rule R2b applies - 'In simple text editors, interpret any NLF the same
> as LS'.

That's not all of what the Standard says.  Just a couple of paragraphs
above Rule R2b, there's this text:

  Note that even if an implementer knows which characters represent
  NLF on a particular platform, CR, LF, CRLF, and NEL should be
  treated the same on input and in interpretation. Only on output is
  it necessary to distinguish between them.

So in practice, IMO the above example does constitute 2 paragraphs,
regardless of the underlying platform's conventions.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 00:08:10 +0100
> Cc: unicode@unicode.org
> 
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in each. 
> Correct?

Yes.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines. This is documented in the Emacs manuals.
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition?

Yes, Emacs uses the "higher-level protocols" clause in HL1, when the
paragraph direction is to be determined from the text.  (There's also
a way for the user or a Lisp program to force a certain base paragraph
direction on all paragraphs in a window that displays some text.)

> Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's?

IME, what Emacs uses gives much better results, yes.

> I believe I understand your concerns with the per-line paragraph
> definition, but this interpretation that I've just shown most likely
> leads to even more broken behavior.

I don't see how the result could be more broken, when the decisions
about base paragraph direction are made much more rarely.  The places
in text where the paragraph direction will be determined under my
proposal is a small subset of the places where it will be determined
by the default UBA rules.  So it will make the same mistakes as the
each-line-is-a-new-paragraph method, but there will be much fewer of
such mistakes.

In addition to this theoretical argument, I have 10 years of using
this in Emacs to back me up.  The only difference between Emacs and
your example is the very first paragraph.

> It's a really nontrivial technical problem to let the terminal
> emulator know where each prompt, and/or each command's output begins
> and ends. There's work going on for letting the terminal emulator
> recognize the prompts, but even if it's successful, it'll probably
> take 5-10 years to reach the majority of the users. And it probably
> still wouldn't solve the case of knowing the boundary between the two
> outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
> they're concatenated with "cat file1.txt file2.txt".

I think you are trying to find a perfect solution, and because it
probably doesn't exist, or at least is hard to come by, you conclude
that a solution that is imperfect should be rejected.

But I'm not saying my proposal is the perfect solution, just that it
is better (sometimes, way better) than the default of considering each
line a paragraph.

> So, what you're arguing for, is that the default behavior should be
> something that's:
> - currently not implementable in a semantically correct way (to stop
> around shell prompts) due to technical limitations, and
> - isn't what Unicode says.

The first point has to do with the search for a perfect solution.  My
advice is to settle for something reasonable even if it is not
perfect.

The second point is incorrect: the UBA explicitly allows the
implementation to apply higher-level protocols for paragraph
direction, see HL1 in UAX#9.

> You have not convinced me that the pros outweigh the cons.

There are no cons in my proposal that aren't already present in the
default each-line-is-a-new-paragraph rule.  So even if the pros don't
outweigh the cons, the balance should be better than under the default.

> That being said, I'm more than open to see such a behavior as a
> future extension, subject of course to the semantic prompt stuff
> being available.

I think the default should provide reasonably good display, and
each-line-is-a-new-paragraph doesn't.


Re: Encoding italic

2019-02-05 Thread wjgo_10...@btinternet.com via Unicode

James Kass wrote:

William’s suggestion of floating a proposal for handling italics with 
VS14 might be an example of the old saying about “putting the cart 
before the horse”.


Well, a proposal just about using VS14 to indicate a request for an 
italic version of a glyph in plain text, including a suggestion of to 
which characters it could apply, would test whether such a proposal 
would be accepted to go into the Document Register for the Unicode 
Technical Committee to consider or just be deemed out of scope and 
rejected and not considered by the Unicode Technical Committee.


If the proposal were allowed to become included in the Document Register 
of the Unicode Technical Committee then if other people wish to submit 
comments and other proposals then that would be possible as it would 
have become established that such a topic is deemed acceptable for 
placing into the Document Register of the Unicode Technical Committee.


William Overington
Tuesday 5 February 2019





Re: Encoding italic

2019-02-05 Thread James Kass via Unicode



William Overington wrote,

> Well, a proposal just about using VS14 to indicate a request for an
> italic version of a glyph in plain text, including a suggestion of to
> which characters it could apply, would test whether such a proposal
> would be accepted to go into the Document Register for the Unicode
> Technical Committee to consider or just be deemed out of scope and
> rejected and not considered by the Unicode Technical Committee.

As long as “italics in plain-text” is considered out-of-scope by 
Unicode, any proposal for handling italics in plain-text would probably 
be considered out-of-scope, as well.  But I could be wrong and wouldn’t 
mind seeing a proposal.




Re: Ancient Greek apostrophe marking elision

2019-02-05 Thread James Tauber via Unicode
On Tue, Feb 5, 2019 at 12:23 AM James Kass via Unicode 
wrote:

> Text a man has JOINED together, let not algorithm put asunder.
>

I was hoping so much that ὃ οὖν ὁ θεὸς συνέζευξεν ἄνθρωπος μὴ χωριζέτω
would have an apostrophe but alas no.