Re: Encoding italic

2019-02-04 Thread James Kass via Unicode



Philippe Verdy responded to William Overington,

> the proposal would contradict the goals of variation selectors and would
> pollute ther variation sequences registry (possibly even creating 
conflicts).
> And if we admit it for italics, than another VSn will be dedicated to 
bold,

> and another for monospace, and finally many would follow for various
> style modifiers.
> Finally we would no longer have enough variation selectors for all 
requests).


There are 256 variation selector characters.  Any use of variation 
sequences not registered by Unicode would be non-conformant.


William’s suggestion of floating a proposal for handling italics with 
VS14 might be an example of the old saying about “putting the cart 
before the horse”.  Any preliminary proposal would first have to clear 
the hurdle of the propriety of handling italic information at the 
plain-text level.  Such a proposal might list various approaches for 
accomplishing that, if that hurdle can be surmounted.




Re: Ancient Greek apostrophe marking elision

2019-02-04 Thread James Kass via Unicode



On 2019-01-28 8:58 PM, Richard Wordingham wrote:
> On Mon, 28 Jan 2019 03:48:52 +
> James Kass via Unicode  wrote:
>
>> It’s been said that the text segmentation rules seem over-complicated
>> and are probably non-trivial to implement properly.  I tried your
>> suggestion of WORD JOINER U+2060 after tau ( γένοιτ⁠’ ἄν ), but it
>> only added yet another word break in LibreOffice.
>
> I said we *don't* have a control that joins words.  The text of TUS
> used to say we had one in U+2060, but that was removed in 2015.  I
> pleaded for the retention of this functionality in document
> L2/2015/15-192, but my request was refused.  I pointed out in ICU
> ticket #11766 that ICU's Thai word breaker retained this facility. ...

Sorry for sounding obtuse there.  It was your *post* which suggested the 
use of WORD JOINER.  You did clearly assert that it would not work.  So, 
human nature, I /had/ to try it and see.


It. did. not. work.  (No surprise.)  But it /should/ have worked. It’s a 
JOINER, for goodness sake!


When the author/editor puts any kind of JOINER into a text string, 
what’s the intent?  What’s the poînt of having a JOINER that doesn’t?


Recently I put a ZWJ between the “c” and the “t” in the word 
“Respec‍tfully” as an experiment.  Spellchecker flagged both “respec” 
and “tfully” as being misspelt, which they probably are.  A ZWNJ would 
have been used if there had been any desire for the string to be *split* 
there, e.g., to forbid formation of a discretionary ligature.  Instead 
the ZWJ was inserted, signalling authorial intent that a ‘more joined’ 
form of the “c-t” substring was requested.


Text a man has JOINED together, let not algorithm put asunder.



Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 4 Feb 2019 21:00:55 +
> From: Richard Wordingham via Unicode 
> 
> > The definition is trivial: the order of characters on
> > display, from left to right.  The only possible reason to split hairs
> > here could be when some characters don't appear on display, like
> > control characters.  Other than that, there should be no doubt what
> > visual order means.
> 
> To me, 'visual order' means in the dominant order of the script.

That is not the correct definition, IMO.

> Furthermore, let me quote from the Bidi Algorithm:
> 
> "In combination with the following rule, this means that trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction)."
> 
> The 'visual end' is clearly not always the right-hand end!

This talks about the "visual end", not about "visual order".


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 4 Feb 2019 22:27:39 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> > The concept appears to exist in the form of the fields of the
> > fifth edition of ECMA-48.  Have you digested this ambitious
> > standard?  
> 
> To be honest: No, I haven't. And I have no idea what those "fields"
> are.

(Taken out of order)

> That being said, I'd really, honestly love to see if someone evaluated
> ECMA's "fields" and created a feasibility study for current terminal
> emulators, similarly to how I did it with TR/53.

They mostly seem to be security, protection and checking features.
They seem to make sense for a captive system used as a till or for stock
look-up by customers.  For example, fields can be restricted as to how
they are overwritten, e.g. not at all, or only with numbers, and some
fields cannot be copied from the terminal.  HTML forms seem to provide
most of this functionality nowadays.

Fields are persistent attributes.

On reading further, the pane boundary functionality seems to be
provided by the 'line home position' and 'line limit position'.  These
would have to be re-established whenever a pane became the active pane,
but they seem to support the notion of writing a paragraph into a
pane, with the terminal sorting out the splitting into lines.  I'm not
sure that this would be portable between ECMA-48 terminals; I get
the impression that there would be a reliance on unstandardised
behaviour being appropriate.  I could be wrong; the specification may
be there.

> I spent (read: wasted) way too much time studying ECMA TR/53 to get to
> understand what it's talking about, to realize that the good parts
> were already obvious to me, and to be able to argue why I firmly
> believe that the bad parts are bad. Remember: These documents were
> created in 1991, that is, 28 years ago. (I'm emphasizing it because I
> did the math wrong for a long time, I though it was 18 years ago :-D.)
> Things have a changed a lot since then.

It took me a while to work out that the recommendations of ECMA TR/53
had been implemented in Issue 5 of ECMA-48.

> As for the BiDi docs, I found that the current state of the art,
> current best practices, exisiting BiDi algorithm differ so much from
> ECMA's approach (which no one I'm aware of cared to implement for 28
> years) that the standard is of pretty little use. Only a few good
> parts could be kept (but needed tiny corrections), and plenty of other
> things needed to be build up anew. This is the only reasonable way to
> move forward.

The relationship between the data store and the presentation store
don't seem to be very well defined.  There may be room for the BiDi
algorithm there.

> If you designed a house 2 or 3 years ago, and finally have the money
> to get it built, you can reasonably start building it. If you designed
> a house 28 years ago and finally have the chance to build it
> (including the exact same heating technologies, electrical system
> etc.), you wouldn't, would you? I'm sure you looked at those plans,
> and started at the very least heavily updating them, or started to
> design a brand new one, perhaps somewhat based on your old ideas.

But a scheme may be more persuasive if it can be said to conform to
ECMA-48.

One thing that is very unclear in ECMA-48 is how characters are
allocated to cells in 'implicit' mode.  As the Arabic encoding
considered contained harakat, it looks as though the allocation is
defined by 'unspecified protocols'. I note that in the scheme
apparently given most consideration, forced Arabic presentation forms
are selected by a combination of escape sequences and Arabic letters.
The 'unspecified protocols' could be interpreted as one grapheme
cluster* per group of cells.  The typical groups would be one cell and
the two cells for a CJK character.

*Grapheme cluster is a customisable concept.
 
> I don't expect it to be any different with "fields" of ECMA-48. I'm
> not aware of any terminal emulator implementing anything like them,
> whatever they are. Probably there's a good reason for that. Whatever
> purpose they aimed to serve apparently wasn't important enough for
> such a long time. By now, if they're found important, they should
> probably be solved by some new design (or at the very least, just like
> I did with TR/53, the work should begin by evaluating that standard to
> see if it's still feasible).

> Instead of spending a huge amount of work on my BiDi proposal, I could
> have just said: "guys, let's go with ECMA for BiDi handling". The
> thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't
> expect it to be different with "fields" either.

Your interpretation document would have explored the issues.

> The starting point for my work was the current state of terminal
> emulators and the surrounding ecosystem, plus the current BiDi
> algorithm; not some ancient plan that was buried deep in some drawer
> for almost three decades. I hope this makes sense.

You're assuming that the 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> IME, this is a grave mistake.  I hope I explained why; it is now up to
> you to decide what to do about that.

Let me share one more thought.

I have to admit, I'm not an Emacs user, I only have some vague ideas
how powerful a tool it is. But in its very core I still believe it's a
text editor – is it fair to say this? It could be used for example to
conveniently create TUTORIAL.he.

I'm not aware of all the kinds of works you can do in Emacs, but I
have a feeling that the kind of work you do in a terminal emulator is
potentially more diverse. (Let's not nitpick that a terminal can run
emacs and emacs has a terminal inside so mathematically speaking it's
all the same...)

"cat TUTORIAL.he" is indeed one of the commands you can execute in a
terminal, and unfortunately, given what terminals currently understand
from their contents, I just cannot make it display as you would prefer
(and I agree would make a lot of sense). But it's just one use case.

There are plenty of line-oriented tools.

Think of "head" and "tail". They operate on lines of files, which end
up being paragraphs in the terminal according to my definition.
According to your definition, they could cut a paragraph in half, they
could render differently than as if the entire file was printed.
According to my definition, you'll always get the same visual
repsesentation, just on the given fragment of the file.

Think of "grep", possibly combined with "-r" to process files
recursively, and "-C" to print context lines. Not only it can cut
paragraphs (of your definition) in half when it displays the matching
line (plus context), but also how would you locate in its output when
it switches from one match's context to the next match's context
within the same file, or to a match in another file? How would you
define a paragraph, and how would you define the bigger unit on which
the paragraph direction is guessed? I think it's again a use case
where my definition of paragraph is less problematic than yours.

Think of ad-hoc shell scripts that use "echo"/"printf" to inform the
user, "read" to read data etc. Or utilities written in C or whatever
that don't care about terminals at all, just print output. In these
cases there's no one formatting / wrapping at 80 columns performed by
the app. A logical segment is typically printed as a single line,
which will be wrapped by the terminal if doesn't fit in the current
width (and in some terminals rewrapped when the terminal is resized),
this matches my definition of paragraph. There's rarely an empty line
injected in these cases; if there is, it is most likely to separate
some even bigger semantical units.

There are just sooo many use cases, it's impossible to perfectly
address all of them at once. "cat TUTORIAL.he" is just one of them,
not necessarily the most typical, not necessarily the one that should
drive the BiDi design.

Let's note that the four "BiDi-aware" terminals that I could test all
define paragraphs as lines – I mean visual lines on their own canvas.
If the terminal is 80 characters wide, and a utility prints a line of
100 characters, it'll obviously wrap into 80+20 characters. And then
these terminals treat them as two separate paragraphs, one with 80
characters and one with 20, and run BiDi separately on them. I'm
confident that my specification which says that it should be preserved
as a 100 character long paragraph and passed to BiDi accordingly is
already a significant step forward.


cheers,
egmont



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> I think it's unreasonable and impractical to expect 'echo', 'cat', and
> its ilk to emit bidi controls (or any other controls) to force
> paragraph direction.  For starters, they won't know what direction to
> force, because they don't understand the text they are processing.

I agree, it is unreasonable for 'echo', 'cat' etc. to emit BiDi controls.

There could be some higher level helper utiities though, let's say a
"bidi-cat" that examines the file, makes a guess, emits the
corresponding escape sequences and cats the file. It's not necessarily
a good approach, but a possible one (at least temporarily until
terminals implement a better one).

On the other hand, it's not unreasonable for higher level stuff (e.g.
shell scripts, or tools like "zip") to use such control characters.

> No, this simple case must work reasonably well with the application
> _completely_ oblivious to the bidi aspects.  If this can't work
> reasonably well, I submit that the entire concept of having a
> bidi-aware terminal emulator doesn't "hold water".

There isn't a magic wand. I can't magically fix every BiDi stuff by
changing the terminal emulator's source code. Not because I'm clumsy,
but because it just can't be done. If it was possible, I wouldn't have
written a long specification, I would have just done it. (Actually, if
it was possible, others would have sure done it long before I joined
terminal emulator development.)

There need to be multiple modes, some of them due to the technical
particularities of terminal emulation that aren't seen elsewhere (e.g.
explicit vs. implicit), and some of them because they are present
everywhere where it comes to BiDi (e.g. paragraph direction). And if
the mode is not set correctly, things might break, there's nothing new
in it.

What my specification essentially modifies is that with this
specification, you at least will have a chance to get the mode right.

Currently there are perhaps like 4 different behaviors implemented
across terminal emulators when it comes to BiDi. An application cannot
control and cannot query the behavior. In order to get Emacs behave
properly, you have to ask your users to adjust a setting (and I cannot
repeat enough times that I find this an unacceptable user experience).
If the settings of the terminal aren't what Emacs expects, the result
could be broken (RTL words might even show up in reverse, LTR order).

The same goes for the random example of "zip -h", assuming that they
add Hebrew translation. Given the current set of popular terminal
emulators, there's no way zip could emit some Hebrew text in a
reliably readable way. Whatever it does, there will be terminal
emulators (and settings thereof) where the result is totally broken
(reversed), or at least unpleasant (wrong paragraph direction used).
Moreover, if "zip" emits the Hebrew text in the semantically correct
logical order (e.g. they use whatever existing framework, like gettext
and a popular .po editor), as opposed to the visual LTR order seen in
some legacy systems, it will need different terminal emulator settings
than Emacs, so if someone uses both zip and Emacs regularly, they'll
have to continuously toggle their terminal's settings back and forth –
have I mentioned how unacceptable I find this as a user? :)

One of the key points of my specification is that applications will be
able to automatically set the mode. Emacs will be able to switch to
the mode it requires, and so will be zip. They will have the
opportunity.

If they don't live with this opportunity, it's not my problem, and
there's nothing I could do about it. Let's say hypothetically that zip
adds Hebrew translations, but refuses to emit the escape sequence that
switches to RTL paragraph direction, and thus its result doesn't look
perfect. Can terminal emulators, can my specification, can me be
blamed in this case? I don't think so. If zip knows exactly what it
wants to print (as with the help page it knows for sure), and is given
all the technical infrastructure to reliably achieve that, it'd be
solely them to blame if they refused to properly use it. It's
absolutely out of the scope of my work to try to fix this case.

"cat" is substantially different. In case of "zip", the creators of
that software know exactly how the output should look like, and
according to my specification (assuming a confirming terminal
emulator, of course) nothing stops them from achieving it. "cat"
doesn't know, cannot know the desired look, since the file itself
lacks this information.

Paragraph direction is a concept that sucks big time. (I have no idea
how Unicode could have got it better, though.) It's a piece of
information that needs to be carried externally along with the text,
in order to make sure it'll be displayed correctly. It's a pain in the
butt, just as much carrying the encoding in the pre-Unicode days was,
and hardly anyone cared about, resulting in incorrect accented letters
way too often. Practically everyone's lazy and 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Tue, 5 Feb 2019 00:08:10 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > by paragraph separator characters.  This means characters whose bidi
> > category is B, which includes Newline, the CR-LF pair on Windows,
> > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.  

It actually gives two different definitions.  Table UAX#9 4 restricts
the type B to *appropriate newline functions; not all newlines are
paragraph separators.

> Indeed, this was an oversight on my side. So, with this definition,
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in
> each. Correct?

No, it depends on when a newline function is 'appropriate'.  TUS 5.8
Rule R2b applies - 'In simple text editors, interpret any NLF the same
as LS'.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines.  This is documented in the Emacs manuals.  
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition? Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's? Or please clarify if I
> misunderstood you here.

He's deriving 'B' from a protocol.

Richard.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
> paragraph separator characters.  This means characters whose bidi
> category is B, which includes Newline, the CR-LF pair on Windows,
> U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

Indeed, this was an oversight on my side. So, with this definition,
every single newline character starts a new paragraph. The result of
printf "Hello\nWorld\n" > world.txt
is a text file consisting of two paragraphs, with 5 characters in each. Correct?

> Actually, Emacs implements the rule that paragraphs are separated by
> empty lines.  This is documented in the Emacs manuals.

That is, Emacs overrides UAX#9 and comes up with a different
definition? Furthermore, you argue that in terminals I should follow
Emacs's definition rather than Unicode's? Or please clarify if I
misunderstood you here.

> > while Emacs itself is a viewer that treats runs between single
> > newlines as paragraphs. That is, Emacs is inconsistent with itself.
>
> Incorrect.  Emacs always treats a run of text between empty lines as a
> single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
> special about TUTORIAL.he, it is just a plain text file with a few
> dozen of bidi formatting controls, needed to show the key sequences
> with weak and neutral characters in correct visual order.  [...]

Thanks for the clarification, I believe it's clear to me now.

> At least with Emacs, it is not the same.  I think considering each
> line as a separate paragraph makes writing bidi plain-text documents
> that look right almost impossible, if each line ends in a newline [...]

> My personal recommendation is to adopt theempty line rule.  It's
> simple enough and gives good results IME. [...]

> I'm surprised that you describe this as such a complex problem.  I
> think you explained up-thread that terminal emulators should cope with
> lines of text arriving piecemeal, which I interpreted as meaning that
> text is stored in the emulator's memory.  Modern emulators running on
> windowed desktops also provide scroll-back buffers, and react to
> expose events.  So I think the text that is currently in the viewport,
> and also some text previously shown, are stored in memory, and can be
> consulted.

The problem is not the memory management.

Let's look at the following session:

---snip---
prompt$ cat file1.txt
This is the
first human-perceived paragraph.

And this is the
second.
prompt$ cat file2.txt
Here this is the
third paragraph.

And this one is
the fourth.
prompt$
---snip---

If you load the files to Emacs, it is perfectly aware of the contents
of the two files. It can define paragraphs however it wants to, and
BiDi the files accordingly.

The terminal emulator doesn't know what's a shell prompt, what's a
command that the user types, what's the output of that command. (You
don't know either from this snippet. Maybe I only cat'ed file1.txt,
and "prompt$ cat file2.txt" is just the sixth line of this eleven-line
file.)

In the terminal emulator's eyes, with Emacs's definition (empty line
delimited), this is one paragraph:

prompt$ cat file1.txt
This is the
first human-perceived paragraph.

and this is another paragraph:

And this is the
second
prompt$ cat file2.txt
Here this is the
third paragraph.

and similarly for the third one.

I believe I understand your concerns with the per-line paragraph
definition, but this interpretation that I've just shown most likely
leads to even more broken behavior.

It's a really nontrivial technical problem to let the terminal
emulator know where each prompt, and/or each command's output begins
and ends. There's work going on for letting the terminal emulator
recognize the prompts, but even if it's successful, it'll probably
take 5-10 years to reach the majority of the users. And it probably
still wouldn't solve the case of knowing the boundary between the two
outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
they're concatenated with "cat file1.txt file2.txt".

So, what you're arguing for, is that the default behavior should be
something that's:
- currently not implementable in a semantically correct way (to stop
around shell prompts) due to technical limitations, and
- isn't what Unicode says.

You have not convinced me that the pros outweigh the cons. That being
said, I'm more than open to see such a behavior as a future extension,
subject of course to the semantic prompt stuff being available.


cheers,
egmont


Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Egmont Koblinger via Unicode
Hi,

> To me, 'visual order' means in the dominant order of the script.

This is not a definition I've come across anywhere else, nor matches
my intuition of "visual order" : the exact visual order (recursive
definition, yay!) of how you see the glyphs being displayed in the
row.

> So,
> if one takes it as natural that a decimal number starts with the most
> significant digits, the decimal numbers used with Arabic are *not*
> stored in visual order if considered as part of that script.

The visual order is: You get the string rendered properly. You scan
with your eyes in one strict direction, and take note of what you see
in that order.

For example, let's say: "Hello Shalom" (the latter word in Hebrew):

HELLO שָׁלוֹם

The logical order:
H
E
L
L
O
space
שָׁ
ל
וֹ
ם

The visual order, from left to right is:
H
E
L
L
O
space
ם
וֹ
ל
שָׁ

Similarly, the visual order from right to left (a much more rarely
seen concept, the exact reverse of the visual LTR order) is:
שָׁ
ל
וֹ
ם
space
O
L
L
E
H

"Visual order" most of the time means "visual left to right order",
although strictly speaking, "visual right to left order" is just as
much a visual order. This is all independent from the script's
dominant order.

> "In combination with the following rule, this means that trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction)."
>
> The 'visual end' is clearly not always the right-hand end!

Yes, that's right. (And it doesn't contradict the definition of
"visual order". For RTL paragraphs, those trailing whitespaces appear
at the beginning of the "visual LTR order").


e.



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 04 Feb 2019 22:39:07 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 4 Feb 2019 19:45:13 +
> > From: Richard Wordingham via Unicode 
> > 
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has
> > to choose how far apart their starting margins are.  I think that
> > could get complicated for plain text if the terminal has unbounded
> > width.  
> 
> But no real-life terminal does.  The width is always bounded.

The Emacs terminal (M-x term) seems to be a reasonable approximation,
with the scroll-left and scroll-right commands changing the margins'
separations.  This is an example of a terminal that has lines with
left-to-right character paths and lines with right-to-left
character paths.  (Such lines are necessarily separated by blank
lines.)  Geometrically, column positions on left-to-right and
right-to-left character paths are incomparable - resizing the window
and scrolling move them differently.

Richard.


Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Asmus Freytag via Unicode

  
  
On 2/4/2019 1:00 PM, Richard Wordingham
  via Unicode wrote:


  To me, 'visual order' means in the dominant order of the script. 

Visual order is a term of art, meaning the characters are ordered
  in memory in the same order as they are displayed on the screen.
Whether that progresses from left to right or right to left would
  then depend on the display algorithm. When screen display
  corresponded to actual buffers in memory, those tended to be
  organized left-to-right, with lowest address at the top left.
The contrasting term is "logical order" which (largely)
  corresponds to the order in which characters are typed or spoken.
Logical order text needs to get rearranged during display
  whenever it does not correspond to visual order.

A./

  



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> > choose how far apart their starting margins are.  I think that could
> > get complicated for plain text if the terminal has unbounded width.
>
> But no real-life terminal does.  The width is always bounded.

Allegedly the no longer maintained FinalTerm, and maybe another one or
two not so popular terminal emulators experimented with this.

VTE and a few other emulators have also received such a feature
request; VTE has rejected it. See
https://bugzilla.gnome.org/show_bug.cgi?id=769440 if you're curious.

Indeed BiDi becomes problematic in the sense that Richard pointed out:
how far should the starting margins be from each other? By terminal
emulators rejecting the idea of unbounded width, this is not a problem
for them.

It might still be a problem for BiDi aware text viewers/edtiors,
though. I mean one possible, obvious approach could be to adjust them
according to the terminal's width. Another is to take it from the
file's contents (e.g. longest line). But maybe there's demand for
other options, e.g. to have those margins 80 characters away from each
other even when the file is viewed on a mobile phone where the
viewport is narrower and the user wishes to scroll horizontally. This
is up for text viewers/editors to decide.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Richard,

> That split is wrong if you want the non-HTML text to lay out reasonably
> well in anything but a higher order protocol forcing RTL.  You need to
> it split as:
>
> lorem ipsum ABC
> <[ DEF foobar

Okay, so you should use LRMs or other similar tricks when wrapping a
human-perceived paragraph of text.

I take it as:

- The expected definition of "paragraph", for the technical sake of
running the BiDi algorithm, is lines of the text file (that is,
between a newline and the next one).

- On top of this technical definition, the document is crafted so that
lines are not longer than a certain threshold, and the human-perceived
paragraphs are usually delimited by empty lines (sometimes by other
means, like bullets of a list).

Sounds like a reasonable approach to me, probably the best to have.
And, by the way, aligns with my BiDi proposal if the higher level
protocol (escape sequences) set the paragraph direction correctly and
disable autodetection.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Richard,

> The concept appears to exist in the form of the fields of the
> fifth edition of ECMA-48.  Have you digested this ambitious standard?

To be honest: No, I haven't. And I have no idea what those "fields" are.

I spent (read: wasted) way too much time studying ECMA TR/53 to get to
understand what it's talking about, to realize that the good parts
were already obvious to me, and to be able to argue why I firmly
believe that the bad parts are bad. Remember: These documents were
created in 1991, that is, 28 years ago. (I'm emphasizing it because I
did the math wrong for a long time, I though it was 18 years ago :-D.)
Things have a changed a lot since then.

As for the BiDi docs, I found that the current state of the art,
current best practices, exisiting BiDi algorithm differ so much from
ECMA's approach (which no one I'm aware of cared to implement for 28
years) that the standard is of pretty little use. Only a few good
parts could be kept (but needed tiny corrections), and plenty of other
things needed to be build up anew. This is the only reasonable way to
move forward.

If you designed a house 2 or 3 years ago, and finally have the money
to get it built, you can reasonably start building it. If you designed
a house 28 years ago and finally have the chance to build it
(including the exact same heating technologies, electrical system
etc.), you wouldn't, would you? I'm sure you looked at those plans,
and started at the very least heavily updating them, or started to
design a brand new one, perhaps somewhat based on your old ideas.

I don't expect it to be any different with "fields" of ECMA-48. I'm
not aware of any terminal emulator implementing anything like them,
whatever they are. Probably there's a good reason for that. Whatever
purpose they aimed to serve apparently wasn't important enough for
such a long time. By now, if they're found important, they should
probably be solved by some new design (or at the very least, just like
I did with TR/53, the work should begin by evaluating that standard to
see if it's still feasible).

Instead of spending a huge amount of work on my BiDi proposal, I could
have just said: "guys, let's go with ECMA for BiDi handling". The
thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't
expect it to be different with "fields" either.

The starting point for my work was the current state of terminal
emulators and the surrounding ecosystem, plus the current BiDi
algorithm; not some ancient plan that was buried deep in some drawer
for almost three decades. I hope this makes sense.

That being said, I'd really, honestly love to see if someone evaluated
ECMA's "fields" and created a feasibility study for current terminal
emulators, similarly to how I did it with TR/53.


cheers,
egmont


Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Richard Wordingham via Unicode
On Sun, 3 Feb 2019 20:50:03 +
Richard Wordingham via Unicode  wrote:

> On Sun, 03 Feb 2019 20:07:51 +0200
> Eli Zaretskii via Unicode  wrote:
 
> Which is why I try to remember to issue the emacs command 'M-x shell'
> command and issue grep commands from the buffer created thereby.  The
> point I'm making is that this emacs command hasn't made terminal
> emulators obsolete, even though it also does graphics.

I now discover that 'M-x term' brings up an Emacs terminal emulator.
That gives grep's output the colouring appropriate for a terminal.  The
cell widths vary from line-to-line.
 
Richard.



Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Doug Ewell via Unicode
http://www.unicode.org/faq/utf_bom.html#utf8-2 
  
--
Doug Ewell | Thornton, CO, US | ewellic.org
 



Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Richard Wordingham via Unicode
On Sun, 03 Feb 2019 18:03:37 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 3 Feb 2019 03:02:13 +0100
> > Cc: unicode@unicode.org
> > From: Egmont Koblinger via Unicode 
> >   
> > > All I am saying is that your proposal should define what it means
> > > by visual order.  
> > 
> > Are you nitpicking on me not giving a precise definition on the
> > otherwise IMO freaking obvious "visual order"  
> 
> Most probably.  The definition is trivial: the order of characters on
> display, from left to right.  The only possible reason to split hairs
> here could be when some characters don't appear on display, like
> control characters.  Other than that, there should be no doubt what
> visual order means.

To me, 'visual order' means in the dominant order of the script.  So,
if one takes it as natural that a decimal number starts with the most
significant digits, the decimal numbers used with Arabic are *not*
stored in visual order if considered as part of that script.

Furthermore, let me quote from the Bidi Algorithm:

"In combination with the following rule, this means that trailing
whitespace will appear at the visual end of the line (in the paragraph
direction)."

The 'visual end' is clearly not always the right-hand end!

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 4 Feb 2019 19:45:13 +
> From: Richard Wordingham via Unicode 
> 
> Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> choose how far apart their starting margins are.  I think that could
> get complicated for plain text if the terminal has unbounded width.

But no real-life terminal does.  The width is always bounded.


Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread James Tauber via Unicode
Endian-ness only affects ordering of bytes within a code unit.

Because UTF-8 has single byte code units, the order is not affected by
endian-ness, only the UTF-8 bit mapping itself.

Note also that endian-ness only affects individual 16-bit code units in
UTF-16. If you have a surrogate pair, endian-ness doesn't effect the
ordering of each 16-bit unit that makes up the pair, only the two bytes
within each of the units.

James



On Mon, Feb 4, 2019 at 2:25 PM Costello, Roger L. via Unicode <
unicode@unicode.org> wrote:

> Hello Unicode Experts!
>
> As I understand it, endian-ness applies to multi-byte words.
>
> Endian-ness does not apply to ASCII characters because each character is a
> single byte.
>
> Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian),
> UTF-32BE and UTF32-LE because each character uses multiple bytes.
>
> Clearly endian-ness does not apply to single-byte UTF-8 characters. But
> what about UTF-8 characters that use multiple bytes, such as the character
> é, which uses two bytes C3 and A9; does endian-ness apply? For example, if
> a file is in Little Endian would the character é appear in a hex editor as
> A9 C3 whereas if the file is in Big Endian the character é would appear in
> a hex editor as C3 A9?
>
> /Roger
>
>

-- 
*James Tauber*
Eldarion  | jktauber.com (Greek Linguistics)
 | Modelling Music
 | Digital
Tolkien 


Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Clive Hohberger via Unicode
Asmus,
I believe it also applies to the bit order in the bytes
I believe UTF-16 and UTF-32 are transmitted as single 16 or 32-bit numbers.
UTF-8 is a stream of 8-bit numbers

Clive

*Clive P. Hohberger, PhD MBA*
Managing Director
Clive Hohberger, LLC
+1 847 910 8794
cp...@case.edu

*Inventor of the Ultracode Bar Code Symbology*
*2017 Label Industry Global Award for Innovation*


On Mon, Feb 4, 2019 at 1:29 PM Asmus Freytag via Unicode <
unicode@unicode.org> wrote:

> On 2/4/2019 11:21 AM, Costello, Roger L. via Unicode wrote:
>
> Hello Unicode Experts!
>
> As I understand it, endian-ness applies to multi-byte words.
>
> Endian-ness does not apply to ASCII characters because each character is a 
> single byte.
>
> Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), 
> UTF-32BE and UTF32-LE because each character uses multiple bytes.
>
> Clearly endian-ness does not apply to single-byte UTF-8 characters. But what 
> about UTF-8 characters that use multiple bytes, such as the character é, 
> which uses two bytes C3 and A9; does endian-ness apply? For example, if a 
> file is in Little Endian would the character é appear in a hex editor as A9 
> C3 whereas if the file is in Big Endian the character é would appear in a hex 
> editor as C3 A9?
>
> /Roger
>
>
>
> UTF-8 is a byte stream. Therefore, the order of bytes in a multiple byte
> integer does not come into it.
>
> A./
>


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 04 Feb 2019 18:53:22 +0200
Eli Zaretskii via Unicode  wrote:

> Date: Mon, 4 Feb 2019 01:19:21 +
> From: Richard Wordingham via Unicode 

>> If you look at it in Notepad, all
>> lines will be LTR or all lines will be RTL.  
 
> That's because Notepad implements _only_ the higher-level protocol for
> base paragraph direction: there's no way to make Notepad determine the
> direction by looking at the text.

Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
choose how far apart their starting margins are.  I think that could
get complicated for plain text if the terminal has unbounded width.

Richard.


Re: Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Asmus Freytag via Unicode

  
  
On 2/4/2019 11:21 AM, Costello, Roger
  L. via Unicode wrote:


  Hello Unicode Experts!

As I understand it, endian-ness applies to multi-byte words.

Endian-ness does not apply to ASCII characters because each character is a single byte.

Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), UTF-32BE and UTF32-LE because each character uses multiple bytes. 

Clearly endian-ness does not apply to single-byte UTF-8 characters. But what about UTF-8 characters that use multiple bytes, such as the character é, which uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in Little Endian would the character é appear in a hex editor as A9 C3 whereas if the file is in Big Endian the character é would appear in a hex editor as C3 A9?

/Roger




UTF-8 is a byte stream. Therefore, the order
of bytes in a multiple byte integer does not come into it.
A./

  



Does "endian-ness" apply to UTF-8 characters that use multiple bytes?

2019-02-04 Thread Costello, Roger L. via Unicode
Hello Unicode Experts!

As I understand it, endian-ness applies to multi-byte words.

Endian-ness does not apply to ASCII characters because each character is a 
single byte.

Endian-ness does apply to UTF-16BE (Big-Endian), UTF-16LE (Little-Endian), 
UTF-32BE and UTF32-LE because each character uses multiple bytes. 

Clearly endian-ness does not apply to single-byte UTF-8 characters. But what 
about UTF-8 characters that use multiple bytes, such as the character é, which 
uses two bytes C3 and A9; does endian-ness apply? For example, if a file is in 
Little Endian would the character é appear in a hex editor as A9 C3 whereas if 
the file is in Big Endian the character é would appear in a hex editor as C3 A9?

/Roger



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 4 Feb 2019 01:19:21 +
> From: Richard Wordingham via Unicode 
> 
> On Sun, 03 Feb 2019 19:50:50 +0200
> Eli Zaretskii via Unicode  wrote:
> 
> > Do you see how this is carefully formatted to avoid overflowing an
> > 80-column line of a typical terminal?  Now suppose this is translated
> > into a RTL language, which causes the Copyright line to start with a
> > strong R letter (because "Copyright" is translated).  You will see the
> > first line flushed to the right margin, then the next line flushed to
> > the left margin (because it's a separate paragraph, and starts with a
> > strong L letter).  Then the line which says "The default action..."
> > will again start at the right.  And so on and so forth -- the result
> > is extremely ugly.
> 
> Depending on the environment.  If you look at it in Notepad, all lines
> will be LTR or all lines will be RTL.

That's because Notepad implements _only_ the higher-level protocol for
base paragraph direction: there's no way to make Notepad determine the
direction by looking at the text.

> Would not a careful translator either ensure that each non-blank
> line had a strong character and that all first strong characters
> were (a) L, (b) R or (c) AL?

This is very hard in practice, and is a tremendous annoyance when
translating message catalogs to RTL languages.  Translation is a hard
enough job even without this complication.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Mon, 4 Feb 2019 00:36:23 +0100
> Cc: unicode@unicode.org
> 
> The Unicode BiDi algorithm states that it operates on paragraphs of
> text, and leaves it up to a higher protocol to define what a paragraph
> exactly is.
> 
> What's the definition of "paragraph" in the context of plain text files?
> 
> I don't think there's a single well-established practice.

Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
paragraph separator characters.  This means characters whose bidi
category is B, which includes Newline, the CR-LF pair on Windows,
U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

> In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way
> more complicated, probably there isn't a well-defined grammar for
> how exactly bullet list entries and alike should become new
> paragraphs.

Actually, Emacs implements the rule that paragraphs are separated by
empty lines.  This is documented in the Emacs manuals.  (That's by
default, users and Lisp programs can control that to some extent.)
This rule is global, and applied to any file or buffer, including
TUTORIAL.he.

> lorem ipsum FED ]> CBA foobar
> 
> The visual representation, in a narrower viewport, might wrap for
> example like this:
> 
> lorem ipsum CBA
> FED ]> foobar

I suggest to leave line wrapping alone for the moment: it is a further
complication.  Let's first talk about text whose every line ends in a
hard newline -- this is what you see in most "simple" text-mode
utilities which we are talking about.  If/when we solve the problems
there, we can then look at the issues with wrapping.

> Here comes the twist. Let's view this latter file with a viewer that
> uses a _different_ definition for paragraph. Let's view it in Gedit,
> Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where
> every newline begins a new paragraph – that's how these viewers define
> the notion of "paragraph" for the sake of BiDi.
> 
> The visual layout in these viewers becomes:
> 
> lorem ipsum CBA
> <[ FED foobar
> 
> which is just not correct. Since here BiDi is run on the two lines
> separately, the initial "<[" is treated as LTR, placed at the wrong
> location in the wrong order, and the glyphs aren't mirrored.

This kind of problems happens all the time, and you cannot avoid it.
Different programs display bidi text differently.  I propose not to
try to solve this problem, because IME it cannot be solved in general.
Let's focus on the terminal emulators that should comply with your
guidelines, and let's try to decide what should they do about base
paragraph direction of text emitted by "simple" text utilities.
If they all make decisions by the same rule, they all will show the
same text identically.

> Now, Emacs ships a TUTORIAL.he which, for most of its contents (but
> not everywhere) seems to treat runs between empty lines as paragraphs,

Correct.

> while Emacs itself is a viewer that treats runs between single
> newlines as paragraphs. That is, Emacs is inconsistent with itself.

Incorrect.  Emacs always treats a run of text between empty lines as a
single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
special about TUTORIAL.he, it is just a plain text file with a few
dozen of bidi formatting controls, needed to show the key sequences
with weak and neutral characters in correct visual order.  (Some of
those controls can probably be removed nowadays, since we now have the
BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was
released.)  In fact, I wrote that tutorial as an exercise, to prove to
myself that Emacs can be useful for editing non-trivial bidi text.

> In case you think I got something wrong with Emacs: Could you please
> give exact definitions:
> - What are the exact units (so-called "paragraphs" by UAX9) that it
> runs BiDi on when it loads and displays a file?

See above: for the purpose of the Emacs UBA implementation, paragraphs
are separated by empty lines.  That is the only rule in EMacs
regarding paragraph determination.

> - What are the exact units (so-called "paragraphs" by UAX9) in
> TUTORIAL.he on which BiDi needs to be run in order to get the desired
> readable version?

The same.  There's nothing special about that file.

> What most likely happens is that in order to see a difference, you'd
> need to have more special symbols, or at least a more special
> constellation of them. Probably TUTORIAL.he is just luckily simple
> enough that such a difference isn't hit.

No, TUTORIAL.he is neither "lucky" nor "simple".  I deliberately used
there almost every bidi formatting control there is, where
appropriate, to make sure this stiff works as intended in an otherwise
plain text file.

> Another possibility is (and I cannot check because I can't speak
> Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to
> get the desired visual one.

There's no cheating there, I assure you.

> This definition of paragraph (stuff between a newline and 

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 04 Feb 2019 05:25:43 +0200
> Cc: unicode@unicode.org
> From: Eli Zaretskii via Unicode 
> 
> Try customizing scroll-conservatively, it sounds like you want that.

Ignore me: I misunderstood what you were looking for.  You are right:
Emacs doesn't support such scrolling method.