subject:"Re\: Proposal for BiDi in terminal emulators"

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Eli Zaretskii via Unicode

> Date: Mon, 4 Feb 2019 21:00:55 +
> From: Richard Wordingham via Unicode 
> 
> > The definition is trivial: the order of characters on
> > display, from left to right.  The only possible reason to split hairs
> > here could be when some characters don't appear on display, like
> > control characters.  Other than that, there should be no doubt what
> > visual order means.
> 
> To me, 'visual order' means in the dominant order of the script.

That is not the correct definition, IMO.

> Furthermore, let me quote from the Bidi Algorithm:
> 
> "In combination with the following rule, this means that trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction)."
> 
> The 'visual end' is clearly not always the right-hand end!

This talks about the "visual end", not about "visual order".

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Egmont Koblinger via Unicode

Hi,

> To me, 'visual order' means in the dominant order of the script.

This is not a definition I've come across anywhere else, nor matches
my intuition of "visual order" : the exact visual order (recursive
definition, yay!) of how you see the glyphs being displayed in the
row.

> So,
> if one takes it as natural that a decimal number starts with the most
> significant digits, the decimal numbers used with Arabic are *not*
> stored in visual order if considered as part of that script.

The visual order is: You get the string rendered properly. You scan
with your eyes in one strict direction, and take note of what you see
in that order.

For example, let's say: "Hello Shalom" (the latter word in Hebrew):

HELLO שָׁלוֹם

The logical order:
H
E
L
L
O
space
שָׁ
ל
וֹ
ם

The visual order, from left to right is:
H
E
L
L
O
space
ם
וֹ
ל
שָׁ

Similarly, the visual order from right to left (a much more rarely
seen concept, the exact reverse of the visual LTR order) is:
שָׁ
ל
וֹ
ם
space
O
L
L
E
H

"Visual order" most of the time means "visual left to right order",
although strictly speaking, "visual right to left order" is just as
much a visual order. This is all independent from the script's
dominant order.

> "In combination with the following rule, this means that trailing
> whitespace will appear at the visual end of the line (in the paragraph
> direction)."
>
> The 'visual end' is clearly not always the right-hand end!

Yes, that's right. (And it doesn't contradict the definition of
"visual order". For RTL paragraphs, those trailing whitespaces appear
at the beginning of the "visual LTR order").


e.

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Asmus Freytag via Unicode


  
  
On 2/4/2019 1:00 PM, Richard Wordingham
  via Unicode wrote:


  To me, 'visual order' means in the dominant order of the script. 

Visual order is a term of art, meaning the characters are ordered
  in memory in the same order as they are displayed on the screen.
Whether that progresses from left to right or right to left would
  then depend on the display algorithm. When screen display
  corresponded to actual buffers in memory, those tended to be
  organized left-to-right, with lowest address at the top left.
The contrasting term is "logical order" which (largely)
  corresponds to the order in which characters are typed or spoken.
Logical order text needs to get rearranged during display
  whenever it does not correspond to visual order.

A./

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Richard Wordingham via Unicode

On Sun, 3 Feb 2019 20:50:03 +
Richard Wordingham via Unicode  wrote:

> On Sun, 03 Feb 2019 20:07:51 +0200
> Eli Zaretskii via Unicode  wrote:

> Which is why I try to remember to issue the emacs command 'M-x shell'
> command and issue grep commands from the buffer created thereby.  The
> point I'm making is that this emacs command hasn't made terminal
> emulators obsolete, even though it also does graphics.

I now discover that 'M-x term' brings up an Emacs terminal emulator.
That gives grep's output the colouring appropriate for a terminal.  The
cell widths vary from line-to-line.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 18:03:37 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 3 Feb 2019 03:02:13 +0100
> > Cc: unicode@unicode.org
> > From: Egmont Koblinger via Unicode 
> >   
> > > All I am saying is that your proposal should define what it means
> > > by visual order.  
> > 
> > Are you nitpicking on me not giving a precise definition on the
> > otherwise IMO freaking obvious "visual order"  
> 
> Most probably.  The definition is trivial: the order of characters on
> display, from left to right.  The only possible reason to split hairs
> here could be when some characters don't appear on display, like
> control characters.  Other than that, there should be no doubt what
> visual order means.

To me, 'visual order' means in the dominant order of the script.  So,
if one takes it as natural that a decimal number starts with the most
significant digits, the decimal numbers used with Arabic are *not*
stored in visual order if considered as part of that script.

Furthermore, let me quote from the Bidi Algorithm:

"In combination with the following rule, this means that trailing
whitespace will appear at the visual end of the line (in the paragraph
direction)."

The 'visual end' is clearly not always the right-hand end!

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-04 Thread Eli Zaretskii via Unicode

> Date: Mon, 04 Feb 2019 05:25:43 +0200
> Cc: unicode@unicode.org
> From: Eli Zaretskii via Unicode 
> 
> Try customizing scroll-conservatively, it sounds like you want that.

Ignore me: I misunderstood what you were looking for.  You are right:
Emacs doesn't support such scrolling method.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sun, 3 Feb 2019 20:35:18 +
> From: Richard Wordingham via Unicode 
> 
> > What is "screen overwriting" in this context?
> 
> When instead of adding lines to the bottom, new lines are added on top
> of and replace existing lines.  I prefer the scrollable terminal
> behaviour to the teletype behaviour of Emacs when running the
> Linux(?) monitor program 'top', but being a fuddy duddy I prefer the
> teletype behaviour of Emacs for 'man'.  From an error message from
> 'info', it seems that the Emacs buffer is classified as a 'dumb'
> terminal.

Try customizing scroll-conservatively, it sounds like you want that.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 18:13:06 +0200
Eli Zaretskii via Unicode  wrote:

> Actually, you pass the characters to be shaped in logical order, and
> then display the produced grapheme clusters in visual order.

Some early systems supporting computerised Hebrew script did pass
characters in left-to-right order.  This works fairly well when the
contents of character cells do not interact.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 20:07:51 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 3 Feb 2019 17:45:06 +
> > From: Richard Wordingham via Unicode 
> >   
> > > > So, what do you recommend I run grep from for Hebrew or Tai
> > > > Lue?
> > > 
> > > Inside Emacs, of course: "M-x grep RET" etc.  
> > 
> > That assumes you like using bindings for all the commands; I
> > don't.  
> 
> What bindings?  "M-x grep" just shows the Grep hits in a separate
> window, you don't need to do anything except reading them.
> 
> The advantage is that you get bidi reordering and text shaping for
> free, something you won't get from most terminals.

Which is why I try to remember to issue the emacs command 'M-x shell'
command and issue grep commands from the buffer created thereby.  The
point I'm making is that this emacs command hasn't made terminal
emulators obsolete, even though it also does graphics.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 18:05:49 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 2 Feb 2019 21:49:40 +
> > From: Richard Wordingham via Unicode 
> > 
> > Eli will probably tell me I'm behind the times, but there are a few
> > places where a Gnome-terminal is better than an Emacs GUI window.
> > One is colour highlighting of text found by grep.  
> 
> ??? The Emacs 'grep' command also highlights the matches, by
> interpreting the escape sequences emitted by Grep the program it
> invokes.
> 
> > Another is that screen overwriting doesn't work in an Emacs
> > window.  
> 
> What is "screen overwriting" in this context?

When instead of adding lines to the bottom, new lines are added on top
of and replace existing lines.  I prefer the scrollable terminal
behaviour to the teletype behaviour of Emacs when running the
Linux(?) monitor program 'top', but being a fuddy duddy I prefer the
teletype behaviour of Emacs for 'man'.  From an error message from
'info', it seems that the Emacs buffer is classified as a 'dumb'
terminal.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sun, 3 Feb 2019 17:45:06 +
> From: Richard Wordingham via Unicode 
> 
> > > So, what do you recommend I run grep from for Hebrew or Tai Lue?  
> > 
> > Inside Emacs, of course: "M-x grep RET" etc.
> 
> That assumes you like using bindings for all the commands; I don't.

What bindings?  "M-x grep" just shows the Grep hits in a separate
window, you don't need to do anything except reading them.

The advantage is that you get bidi reordering and text shaping for
free, something you won't get from most terminals.

> Command recall and having completion options serve me very well.  Your
> suggestion comes unstuck when I attempt to switch between the window's
> keyboard and the MULE keyboard in the middle of the command.  'M-x'
> isn't recursive.

This isn't an Emacs forum, so I will leave it at that; but you are
wrong on all counts.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 18:14:53 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sun, 3 Feb 2019 02:43:06 +
> > Cc: Kent Karlsson 
> > From: Richard Wordingham via Unicode 
> > 
> > So, what do you recommend I run grep from for Hebrew or Tai Lue?  
> 
> Inside Emacs, of course: "M-x grep RET" etc.

That assumes you like using bindings for all the commands; I don't.
Command recall and having completion options serve me very well.  Your
suggestion comes unstuck when I attempt to switch between the window's
keyboard and the MULE keyboard in the middle of the command.  'M-x'
isn't recursive. Still, your suggestion should be useful for grepping
for ASCII stuff.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sun, 3 Feb 2019 02:43:06 +
> Cc: Kent Karlsson 
> From: Richard Wordingham via Unicode 
> 
> So, what do you recommend I run grep from for Hebrew or Tai Lue?

Inside Emacs, of course: "M-x grep RET" etc.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sun, 3 Feb 2019 01:30:26 +
> From: Richard Wordingham via Unicode 
> 
> Shaping for RTL scripts happens on strings stored in logical order.
> These are then laid out right to left, though the dominant usage of
> the term 'advance width' for right-to-left glyph sequences feels
> perversely different from the use for left to right glyph sequences.
> 
> Passing text in the form of characters in left-to-right order is an
> annoying distraction, presumably forced on you by the attempt to
> maximise compatibility with existing systems.

Actually, you pass the characters to be shaped in logical order, and
then display the produced grapheme clusters in visual order.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sat, 2 Feb 2019 23:02:10 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> On top of this, I make the clarification that combining marks need to
> be reordered to be sent out to the terminal emulator _after_ their
> base letter

That is true in general regarding any text shaping: the shaping engine
needs the characters to be submitted in the logical order.  When Emacs
works on a text-mode terminal, it sends characters to be shaped
together, such as base character and its combining marks, in logical
order, even when the surrounding text is reordered into visual order.

> What I add is another mode (the technically less problematic
> "implicit" mode where the terminal displays the contents just as any
> BiDi-aware graphical text editor, browser etc. would do) for the
> sake of "cat"-like simple utilities

I think there are hard problems even for such "simple" utilities, and
I will start a separate thread about this.

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sat, 2 Feb 2019 21:49:40 +
> From: Richard Wordingham via Unicode 
> 
> Eli will probably tell me I'm behind the times, but there are a few
> places where a Gnome-terminal is better than an Emacs GUI window.  One
> is colour highlighting of text found by grep.

??? The Emacs 'grep' command also highlights the matches, by
interpreting the escape sequences emitted by Grep the program it
invokes.

> Another is that screen overwriting doesn't work in an Emacs window.

What is "screen overwriting" in this context?

Re: Proposal for BiDi in terminal emulators

2019-02-03 Thread Eli Zaretskii via Unicode

> Date: Sun, 3 Feb 2019 03:02:13 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> > All I am saying is that your proposal should define what it means by
> > visual order.
> 
> Are you nitpicking on me not giving a precise definition on the
> otherwise IMO freaking obvious "visual order"

Most probably.  The definition is trivial: the order of characters on
display, from left to right.  The only possible reason to split hairs
here could be when some characters don't appear on display, like
control characters.  Other than that, there should be no doubt what
visual order means.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Sun, 03 Feb 2019 02:01:18 +0100
Kent Karlsson via Unicode  wrote:

> Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode"
> :

> > Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the
> > lamedh?  The depth then is the maximum depth, not the sum of the
> > depths.   
> 
> Do you want to view/edit such texts on a terminal emulator? (Rather
> than a GUI window.)
>  
> > Tai Lue has 'mai sat 3 lem' - that's three marks above for a
> > combination common enough to have a name.

> I don't question that as such. But again, do you want to view/edit
> such texts on a **terminal emulator**?

Oddly, I feel happier running bash on Gnome-terminal than an emacs
shell process.  What GUI window Perhaps I'm spoilt by some of the
features like colour.  Maybe I'd be happier if I could work how to
get bash's emacs mode to work when running under emacs.  I'd be grepping
such material rather than viewing it. Moreover, I may be looking
through a lot of files rather than viewing a single one.

> It is just that such things are likely to graphically overflow the
> "cell" boundaries, unless the cells are disproportionately high (i.e.
> double or so line spacing). Doesn't really sound like a terminal
> emulator... I do not think terminal emulators should be used for
> ALL kinds of text.

I don't need fixed-width cells.  But otherwise, there are uses for both
terminal emulators and teletype emulators.

Different scripts (and languages within a script for that matter) merit
different cell aspect ratios.

So, what do you recommend I run grep from for Hebrew or Tai Lue?

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

On Sun, Feb 3, 2019 at 2:32 AM Richard Wordingham via Unicode
 wrote:

> That first reference doesn't even use the word 'visual'.

The Unicode BiDi algorithm does speak about "visual positions for
display", "reordering for display" etc.

> All I am saying is that your proposal should define what it means by
> visual order.

Are you nitpicking on me not giving a precise definition on the
otherwise IMO freaking obvious "visual order", or am I missing
something fundamental?

> Shaping for RTL scripts happens on strings stored in logical order.

That's what I recommend in my current v0.1, which was vetoed by you
guys, claming that the terminal emulator should do it even in cases
when it's only aware of the visual order.

> Passing text in the form of characters in left-to-right order is an
> annoying distraction, presumably forced on you by the attempt to
> maximise compatibility with existing systems.

Nope; passing text in visual order(*) is a technical necessity for
Emacs (as Eli confirmed it) and all other fullscreen apps (text
editors and such), as I provide a detailed proof for that in my
proposal. It's literally impossible to perform visual cropping on a
string (required by practically all fullscreen text editors), visual
concatenation of strings (e.g. a line of tmux which has two panes next
to each other), and in the mean time preserve the logical order that's
passed on. You just can't define a logical order after visual
operations.

(*) To be pedantic, they could pass the text in whatever order they
want to, with random cursor movements in between. The point is that
the terminal emulator won't reshuffle the cells, that is, they should
write into column 1 whichever they want to appear at the leftmost
position, into column 2 whichever they want to appear in column 2, and
so on. And unless the cursor is moved explicitly, the cursor keeps
moving forward to higher numbered columns, that is, the terminal
expects to receive visual order.

> Casting text into grids of 'characters' requires consideration of all
> types of writing elements.  The division into panes is an awkward
> complication; panes in the application not shared with the terminal is
> even worse for shaping.

I'm really not sure what you're trying to say here.

The feeling I get, and I'm happy if you can prove me wrong, is that
while you're truly knowledgeable about shaping, you haven't yet
understood the very fundamentals why terminals are vastly different
from let's say web browsers, which results in the technical necessity
of often relying on visual order. There's even a separate section
dedicated to explaining this in my spec. If terminals weren't vastly
different, BiDi there would've been solved along with the birth of the
Unicode BiDi algorithm, I wouldn't have spent months working on this
proposal, and we wouldn't be having this discussion right now :)

Remember, this whole story is about finding a compromise between what
a terminal emulator is, and what BiDi scripts require (incl. shaping).
If you want to do BiDi and shaping without compromises, you should get
away from terminal emulators (as Kent has also suggested). Having a
strict grid of characters is such a compromise. The terminal emulator
not being aware of the entire logical string, only the currently
onscreen bits (that is, a cropped version of the string), which
results in the need for the explicit mode (visual order) is another
such compromise.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Sat, 2 Feb 2019 23:02:10 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> On Sat, Feb 2, 2019 at 9:57 PM Richard Wordingham
>  wrote:
> 
> > Seriously, you need to give a definition of 'visual order' for this
> > context.  Not everyone shares your chiralist view.  
> 
> When I look at the Unicode BiDi algorithm, or go to an online demo at
> https://unicode.org/cldr/utility/bidic.jsp, or look at the FriBidi API
> etc., their very basic functionality is that I pass the logical order
> (as the string is expected to be stored in text files etc.), and the
> result of the algorithm is the visual order.

That first reference doesn't even use the word 'visual'.  When I look
in Standard Annex 9, 'Unicode Bidirectional Algorithm', I find, 'In
combination with the following rule, this means that trailing
whitespace will appear at the visual end of the line (in the paragraph
direction)'.  Paragraph direction, of course, can be left-to-right or
right-to-left.  Your best hope there is, 'No bidirectional formatting.
This implies that the system does not visually interpret characters
from right-to-left scripts.'  It's a shame that that statement is not
true; one could build a system using N'ko decimal digits that only
visually interpreted characters from right-to-left scripts.

> What else do I need to further specify in the concept of "visual
> order"?

All I am saying is that your proposal should define what it means by
visual order.

> This is the low level issue I'm trying to address, to make sure that
> letters of words are always shown in the correct order. There's no way
> you could do shaping underneath this level, it makes no sense to talk
> about shaping, zero-width (non)joining, special Khmer symbols and
> whatnot on reversed words, right?

> The order of the letters need to be
> fixed first, which is what I'm doing, and then all the bells and
> whistles needed for shaping might come on top of this.

Shaping for RTL scripts happens on strings stored in logical order.
These are then laid out right to left, though the dominant usage of
the term 'advance width' for right-to-left glyph sequences feels
perversely different from the use for left to right glyph sequences.

Passing text in the form of characters in left-to-right order is an
annoying distraction, presumably forced on you by the attempt to
maximise compatibility with existing systems.

Casting text into grids of 'characters' requires consideration of all
types of writing elements.  The division into panes is an awkward
complication; panes in the application not shared with the terminal is
even worse for shaping.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode



Den 2019-02-02 16:12, skrev "Richard Wordingham via Unicode"
:

> On Sat, 02 Feb 2019 14:01:46 +0100
> Kent Karlsson via Unicode  wrote:
> 
>> Well, I guess you may need to put some (practical) limit to the number
>> of non-spacing marks (like max two above + max one below; overstrikes
>> are an edge case). Otherwise one may need to either increase the line
>> height (bad idea for a terminal emulator I think) or the marks start
>> to visually interfere with text on other lines (even with the hinted
>> limits there may be some interference), also a bad idea for a terminal
>> emulator. So I'm not so sure that non-spacing marks is a piece of
>> cake... (I.e., need to limit them.)
> 
> Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the
> lamedh?  The depth then is the maximum depth, not the sum of the
> depths. 

Do you want to view/edit such texts on a terminal emulator? (Rather
than a GUI window.)
 
> Tai Lue has 'mai sat 3 lem' - that's three marks above for a
> combination common enough to have a name.  Throw in the repetition mark
> and that's four marks above if you treat the subscript consonant as a
> mark (or code it to comply with the USE's erroneous grammar).

I don't question that as such. But again, do you want to view/edit such
texts on a **terminal emulator**?

It is just that such things are likely to graphically overflow the
"cell" boundaries, unless the cells are disproportionately high (i.e.
double or so line spacing). Doesn't really sound like a terminal
emulator... I do not think terminal emulators should be used for
ALL kinds of text.

/Kent K

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Benjamin Riefenstahl via Unicode

Hi Richard,

> Benjamin Riefenstahl wrote:
>> the severe limitations of that environment.

Richard Wordingham writes:
> Eli will probably tell me I'm behind the times, but there are a few
> places where a Gnome-terminal is better than an Emacs GUI window.  One
> is colour highlighting of text found by grep.  Another is that screen
> overwriting doesn't work in an Emacs window.

I have not followed all of this thread, but is that on-topic?  Anyway I
did not mean to talk about Emacs GUI windows, they are a completely
different animal from terminal windows in my mind.  Where Emacs GUI
windows lack features in their interaction with other programs, people
who care about that are implementing those features.  There is no theory
or research necessary, beyond understanding the existing codebase.

>> Additional character forms could be added, where the Unicode
>> repertoire is not sufficient.  This could use PUA characters

> You do not need PUA. For U+0756 ARABIC LETTER BEH WITH SMALL V, we
> can form:
>
> Initial form:   200C 0756 200D
> Medial form:200D 0756 200D
> Final form: 200D 0756 200C
> Isolated form:  200C 0756 200C
>
> The tricky bit is to get the terminal to accept them as cell contents.

If you want to implement in the terminal that it should interprete these
sequences, you can just as well implement shaping as a whole,
i.e. interprete any sequence that needs shaping.  There is no reason for
control characters here, I think.

I was looking at it from the standpoint of what works now, sending
presentation forms to the terminal, and what than could be simple means
to extend that mechanism to support more shaping variants.  PUA
characters could work without changes in the terminal emulators
themself.  You would only need the font that supports those PUA
characters, which is easy if you start from a Truetype font that already
supports that script and thus presumably already has that glyph.  From
my POV that is a very simple technique.

benny

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

> My main interest in this, though, is in improving the general run of
> Indic terminal cell editors.  If we can get Gnome-terminal working for
> Kharoshthi, things should improve for LTR Indic.  Even working on the
> false assumption that Indic scripts are like Devanagari would be an
> improvement, despite my comments about Khmer.

So, as for concrete bugs, there's the aforementioned VTE bug 584160.
You might want to give the pending patches a try, or (to keep the
relevant discussion at one place) comment over there about your
desired priorities etc.

We've also set up a "Terminal WG" on freedesktop
(https://gitlab.freedesktop.org/terminal-wg), a place intended for
specifications. If you/we feel like certains bits around
Devanagari/Khmer/etc. handling need a proper specification before we
could jump to the implementation, probably that would be the best
platform to discuss that. Reason being that I don't know when I'd be
able to address them, if ever, but there are multiple terminal
emulator developers waiting there for such challenges. Also, IMHO a
bugtracker is a better forum than a mailing list if parties can't all
immediately work on the problem :)

I'm definitely aiming to fix the basic Devanagari rendering (that is:
spacing marks), for this autumn's VTE release. Maybe even for this
spring's. I probably won't do more (like Virama), they'll have to wait
for the HarfBuzz port.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

On Sat, Feb 2, 2019 at 9:57 PM Richard Wordingham
 wrote:

> Seriously, you need to give a definition of 'visual order' for this
> context.  Not everyone shares your chiralist view.

When I look at the Unicode BiDi algorithm, or go to an online demo at
https://unicode.org/cldr/utility/bidic.jsp, or look at the FriBidi API
etc., their very basic functionality is that I pass the logical order
(as the string is expected to be stored in text files etc.), and the
result of the algorithm is the visual order.

On top of this, I make the clarification that combining marks need to
be reordered to be sent out to the terminal emulator _after_ their
base letter, because that's how terminal emulators work. The BiDi
problem area can only be reasonably addressed in the display layer, by
leaving the emulation layer pretty much unchanged. I find it
unreasonable to introduce a new mode where the combining accents are
sent to the terminal emulator _before_ their base letter. (On an
offtopic note, I wish that was the only mode in Unicode, it would
simplify a couple of things in the handling of streams. But this ship
has sailed decades ago.)

This reordering for the combining accents to come after (that is: to
the right) of the base letter in the LTR visual order is what e.g.
FriBidi does by default, due to the REORDER_NSM flag being set by
default.

Essentially, the "explicit mode" that my specification introduces is
the exact same behavior that most terminal emulators do now, and the
one that e.g. Emacs requires. They lay out the codepoints they
receive, from left to right. Nothing is going to change there. What I
add is another mode (the technically less problematic "implicit" mode
where the terminal displays the contents just as any BiDi-aware
graphical text editor, browser etc. would do) for the sake of
"cat"-like simple utilities, while being unsuitable for Emacs and
friends. My work also specifies how/when exactly to toggle back and
forth between these two modes.

What else do I need to further specify in the concept of "visual order"?

> A visible U+17D2 has no rôle in the Khmer writing system.  On
> computers, it is a warning that the input of a subscript consonant is
> only half done.  There are three units of the writing system in that
> word - KHMER LETTER PO, KHMER CONSONANT SIGN COENG RO*, and KHMER SIGN
> YUUKALEAPINTU.

> [and I could quote a whole lot more]

Richard, you are obviously magnitudes more savvy in shaping and stuff
than me, and I can't quickly pick up your knowledge to properly answer
to all the issues you mentioned.

What you probably still haven't realized is that I aimed to address a
much lower level issue than the ones you keep bringing up. Currently,
no matter what terminal emulator you pick, as soon as you start doing
BiDi (vim, emacs, cat, echo...), you end up with words being written
backwards. I mean, maybe they show up correctly with emacs, but they
show up incorrectly with vim and cat. Then you switch to a different
emulator, or toggle a setting, and suddenly vim and cat will be okay,
and emacs won't. This is bad.

This is the low level issue I'm trying to address, to make sure that
letters of words are always shown in the correct order. There's no way
you could do shaping underneath this level, it makes no sense to talk
about shaping, zero-width (non)joining, special Khmer symbols and
whatnot on reversed words, right? The order of the letters need to be
fixed first, which is what I'm doing, and then all the bells and
whistles needed for shaping might come on top of this.

Right now I'm doing this BiDi work all voluntarily. As much as I'd
love to solve all the problems of the world, I don't have capacity for
that. As for shaping, chances are that I'm not going to get there,
unless someone offers a decent paid job :P. What I'm looking for right
now is feedback on whether the low-level BiDi work makes sense, and
whether it really creates proper grounds for building shaping etc. on
top of it one day.

Hope this clarifies a lot. And again, thanks for all your precious
input, but we've heavily diverged from the scope of my work.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Sat, 2 Feb 2019 12:54:16 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> > > Are they okay to be present in visual order (the terminal's
> > > explicit mode, what we're discussing now) too?  
> >
> > Where do you define the order for explicit mode?  
> 
> In explicit mode, the application (Emacs, Vim, whatever) reorders the
> characters, and passes visual order (left to right) to the terminal
> emulator. The terminal emulator preserves this visual order, doesn't
> reshuffle anything.

Seriously, you need to give a definition of 'visual order' for this
context.  Not everyone shares your chiralist view.

> How to handle ZW(N)J in visual order? What's the desired way? Is it
> specified anywhere? As far as I know, they specify the relation
> between two adjacent characters of the logical order, which might not
> even become adjacent in the visual. Should they always "stick" to the
> preceding character, for example?

> The Unicode BiDi algorithm doesn't seem to make a difference between
> base letters and combining accents for reordering. So, given in an RTL
> text a base letter + a combining accent, the BiDi algorithm gives the
> visual LTR order of the combining accent first (on the left), followed
> by the base letter. This order is not okay for terminal emulators.
> Combining accents have to be reordered in the output of the Unicode
> BiDi algorithm, so that they come after the base letter even in the
> visual LTR order. This is e.g. what FriBidi does by default, due to
> the REORDER_NSM flag.

> Presumably it doesn't just reorder non-spacing combining accents, but
> also ZW(N)J and alike symbols too, which already smells pretty
> problematic, doesn't it? Or is this what you need there, too?

Even for logically ordered text, the positioning of the joiners is not
spelt out.  For example, I may have the sequence , and want to specify the ligating behavior of NA.  I would chose
, but this wouldn't let me choose
between it ligating with NA or with TA.

What happens when one selects text from the display?  I think this may
affect the choice of text representation for the cells.

For storing an explicit string in unnatural order free of bidi controls,
I would start with the equivalent implicit mode string, reverse it, and
pass that.  I believe the cell contents would then need to be reversed
again for rendering.  A good test case would be ; the ZWJ ligates the points, not base consonants.

> > There may be complications in ensuring that
> >  gets
> > stored as the content of a single cell.  
> 
> How should the terminal emulator know which cell (the previous or the
> subsequent) do these two s belong to?

I think this has to depend on convention.  One scheme that might work
is, storing the contents in logical order:

 =>  ZWJ and ZWJ 
ZWJ =>  ZWJ and ZWJ 
ZWNJ =>  and 
ZWJ ZWNJ =>  and ZWJ 
ZWNJ ZWJ =>  ZWJ and 

It may be better to have left and right conection bits in the cell
attributes instead of characters, and restore ZWJ and ZWNJ when the
text is cut and pasted from the terminal.  Note that storing
presentation forms in the terminal would, nowadays, normally cause cut
and paste to obtain an unfaithful copy of the original text. 

> > > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> > > above.  
> >
> > Example, please.  
> 
> Cropped strings, cropped strings that are adjacent to each other, and
> faulty shaping could kick in there.
> 
> Two fields on the UI. One in columns 36-40 with cyan background,
> aiming to show ABCDEF, but due to limited room, can only show ABCDE
> (let's say it's scrolled horizontally this way). Another in columns
> 41-45 with yellow background, aiming to show UVWXYZ, but due to
> limited space only VWXYZ is shown (it's scrolled horizontally like
> this).
> 
> What the terminal emulator sees is a continuous text of ABCDEVWXYZ.
> What the application wants to have is to get E shaped as if there was
> an F on its right, and get V shaped as if there was an U on its left.

Task:
So the text it's to show is parts of FEDCBA and ZYXWVU.  They are not
continuous with any other text in the terminal.  The display command
will not affect anything but columns 36 to 45.

Assumptions:
FEDCBA and ZYXWVU are each parts of right-to-left runs.

Solution:
The implicit mode text would be

ZYXWVEDCBA

(This assumes that Z, V, E and A could otherwise join with the contents
of other cells.)

So send left-to-right text:

ABCDEVWXYZ

> Once you address this problem, I'm not sure ZW(N)J are still
> required/desireable, rather than applying this more generic solution
> there as well.
> 
> > At present, VTE positions LTR Indic preceding spacing combining
> > marks after the consonant.  I though your draft scheme corrected
> > this very local bidi issue, which is so local that the bidi
> > algorithm ignores it.  
> 
> Indic spacing combining marks are handled incorrectly by VTE and are
> being addressed in bug 584160 which I've already linked. This
> particular issue I

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

> Unicode may not deprecate the tag characters, but the characters of
> Plane 14 are widely deplored, despised or abhorred. That is why I
> think of it as the deprecated plane.

Think of it as the deplored plane, then, or the despised plane or the abhorred 
plane or the Plane That Shall Not Be Mentioned.

"Deprecated" is a term of art in Unicode.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Use of tag characters in emoji sequences (was: Re: Proposal for BiDi in terminal emulators)

2019-02-02 Thread Doug Ewell via Unicode

Philippe Verdy wrote:

> Actually not all U+E0020 through U+E007E are "un-deprecated" for this
> use.

Characters in Unicode are not "deprecated" for some purposes and not for 
others. "Deprecated" is a clearly defined property in Unicode. The only 
reference that matters here is in PropList.txt:

E ; Other_Default_Ignorable_Code_Point # Cn   
E0001 ; Deprecated # Cf   LANGUAGE TAG
E0002..E001F  ; Other_Default_Ignorable_Code_Point # Cn  [30] 
..
E0020..E007F  ; Other_Grapheme_Extend # Cf  [96] TAG SPACE..CANCEL TAG
E0080..E00FF  ; Other_Default_Ignorable_Code_Point # Cn [128] 
..

Note carefully that the code point marked "Deprecated" is deprecated, and the 
others listed here are not. (My earlier post saying that U+E007F was still 
deprecated was incorrect, as Andrew noted.)

> For now emoji flags only use:
> - U+E0041 through U+E005A (mapping to ASCII letters A through Z used
> in 2-letter ISO3166-1 codes). These are usable in pairs, without
> requiring any modifier (and only for ISO3166-1 registered codes).

Section C.1 of UTS #51 says otherwise:

tag_baseU+1F3F4 BLACK FLAG
tag_spec(U+E0030 TAG DIGIT ZERO .. U+E0039 TAG DIGIT NINE,
U+E0061 TAG LATIN SMALL LETTER A .. U+E007A TAG LATIN SMALL LETTER 
Z)+

Emoji flags use lowercase tag letters, not uppercase, and may also use digits. 
The digits are for CLDR subdivision IDs containing ISO 3166-2 code elements 
that happen to be numeric, and there are plenty of these. For example, "fr75" 
is the subdivision ID for Paris. Almost all ISO 3166-2 code elements in France 
are numeric.

> - I think that U+0030 through U+E0039 (mapping to ASCII digits 0
> through 9) are reserved for ISO3166 extensions, started with only the
> 3 "countries" added in the United Kingdom ("ENENG", "ENSCO" and
> "ENWLS"), with possible pending additions for other ISO3166-2, but not
> mapping any dash separator).

There is no top-level country "EN", and if there were, I doubt Scotland and 
Wales would be enthusiastic to be considered part of it.

In any case, "gbeng" and "gbsco" and "gbwls" are merely the only subdivision 
IDs that are designated "RGI," or "recommended for general interchange," in 
CLDR. Any other subdivision ID can be used in a flag tag sequence, although the 
lack of RGI designation may cause vendors to think the sequence is "recommended 
against" and not support it in fonts.

As shown above, tag digits are not reserved for "ISO 3166 extensions" (possibly 
implying ISO 3166-1), but are already usable for ISO 3166-2 code elements.

> These tags are used as modifiers in sequences starting by a leading
> U+1F3F4
> 
> (WAVING BLACK FLAG) emoji.

This is true. (Note the lowercase tag letters.)

> - U+E007F (CANCEL TAG) is already used too for the regional extensions
> as a mandatory terminator, as seen in the three British countries.

This is true.

> It is not used for country flags made of 2-letter emoji codes without
> any leading flag emoji.

This is true, but not particularly relevant, as these use Regional Indicator 
Symbols and have nothing to do with the "emoji codes" discussed elsewhere.

> And the proposal discussed here to use U+E003C, mapped to the ASCII
> "<" LOWER THAN

LESS-THAN SIGN

> as a leading tag sequence for reencoding HTML tags in sequences
> terminated by U+E003E ">" (and containing HTML element names using
> lowercase letter tags,

Only "b", "i", "u", and "s" by definition.

> possibly digit tags in these names,

No.

> and "/" for HTML tags terminator, possibly also U+E0020 SPACE TAG for
> separating HTML attributes, U+003D "=" for attribute values, U+E0022
> (') or U+E0027 (") around attribute values, but a problem if the
> mapped element names or attributes contain non-ASCII characters...)

None of these are part of Andrew's mechanism. It's just b, i, u, and s.

> is not standard

Neither Andrew nor anyone else claimed it was.

> (it's just an experiment in one font),

It applies to any TrueType font, because the rendering engine can apply these 
four styles (in any combination) to any TrueType font.

> and would in fact not be compatible with the existing specification
> for tags.

Good thing nobody claimed they were.

> So only E+E0020 through U+E0040, and U+E005B through U+E007E remain
> deprecated.

Da capo.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Benjamin Riefenstahl via Unicode

Hi Egmont, hi all,


This is a interesting discussion here.  If only because I would have
thought that there is only minimal interest by the actual target
audience in supporting these scripts in a terminal, given the severe
limitations of that environment.  The most important limitation seems to
me that a monospaced font must be used, which does not suite most
scripts that do shaping.  On the script-level I am familiar with Arabic,
Syraic and Mandaic (I don't actually speak any of these languages, so if
you want a real expert, I am not that person).  Monospaced Arabic
struggles and is not very elegant.  I have not seen solutions for
monospaced Syriac or Mandaic but I have trouble to even to imagine them.

OTOH, that inelegance maybe can be an excuse (or a guide if you prefer)
to make the implementation simpler in other respects, because
expectations should be lower than for a graphical application.

Anyway, as a concrete addition to the discussion, I have a simple Arabic
shaping solution for Emacs on the terminal, especially on the Linux
console, and this discussion finally made me make it public on Gitlab,
see https://gitlab.com/cc_benny/termshape.  The Gitlab CD is activated,
so (mostly) ready-make Emacs packages can be downloaded as build
artifacts.  If anybody wants to discuss this implementation, we should
probably move that discussion somewhere else, like to the Emacs mailing
list (https://lists.gnu.org/mailman/listinfo/emacs-devel).

Some specific technical points from thinking about the problem on my
side:

Presentation forms: Termshape uses the Arabic presentation forms
available and so it is somewhat limited as mentioned by Eli.  Given that
we need to keep the implementation simple anyway, I am not sure that
significantly more is really needed, at least given what Emacs provides
already.  Additional character forms could be added, where the Unicode
repertoire is not sufficient.  This could use PUA characters or other
means like terminal control sequences.  In both cases a common
understanding would be needed between the terminal (or the font used by
it) and the application, outside of Unicode.

Ligatures: With most shaping one character is transformed into a
character form that still only occupies one cell.  A ligature like
lam-alif OTOH only occupies one cell for two characters, so for
justification etc. the application will have to know that the two
characters together have a width of 1 on the screen.  This is easier if
the applicaton does the selection of ligatures.  If you want to do this
in the terminal, the application would probably need to have some way to
measure the display width of a string, so that it can handle the
situation.  Be prepared though for the application to make quite a lot
of these requests.  For my own main use case for Emacs on a terminal,
display over SSH, that could become a problem.

Diacritics: The application can know what is a non-spacing character and
what is not.  So it can know that diacritics do not occupy their own
cell and it should be able to ignore whether the terminal supports a
specific diacritic or not.  If the terminal does not support a diacritic
the terminal can either just leave it out or the terminal can mess up
the display more of less irreparably.  In the first case, the worst is
that the user does not see the character, in the second case the
application cannot do anything about it with reasonable effort IMO.

A real problem is a combination of diacritics and ligatures.  Any
diacritic applies to only one character in the ligature, and between the
application and the terminal it is currently not possible to determine
which one.  This is one area where an implementation in the terminal
would clearly have the advantage.  But a terminal control sequence could
also help.  IMO we are talking about a luxury problem here, though.  Do
we want to set as our first goal showing complete quranic verses in all
their glory, or are we satisfied with everyday Arabic like say the
website of a modern Arabic newspaper?


Thanks for your effort and for starting this discussion,
benny

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Sat, 2 Feb 2019 13:18:03 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> On Sat, Feb 2, 2019 at 12:43 PM Richard Wordingham via Unicode
>  wrote:
> 
> > I'm not conversant with the details of terminal controls and I
> > haven't used fields.  However, where I spoke of lines above, I
> > believe you can simply translate it to fields.  I don't know how
> > one best handles fields - are they a list, possibly of rows within
> > fields, or are they stored as cell attributes?  
> 
> The very essential is that the terminal emulator stores "cells".
> Pretty much all the data (with very few exceptions) resides in cells.
> 
> A cell contains a base letter, followed by possibly a few non-spacing
> marks. A cell has a foreground color, background color, bold,
> underlined, italic etc. properties.
> 
> How these cells are linked up, in an array or whatever, is mostly
> irrelevant since it's likely to be different in every implementation.
> 
> Of course it is possible to extend the per-cell storage to contain a
> "previous" and a "next" character, as to be used for shaping purposes
> only. Some questions: Is this enough (e.g. aren't there cases where
> more than the immediate neighbor are relevant)? Is the next base
> character enough, or do we also need to know the combining accents
> that belong to that? And can't we store significantly less information
> than the actual letter (let's say, 1 out of 13 [randomly made up
> number] possible ways of shaping)?

Truncation at the start of the string gives us the clearest nasty.  If
you look at TUS Figure 13-7, you'll find that the final U+182D in
ᠵᠠᠷᠯᠢᠭ  _jarlig_
'order' and  ᠴᠢᠷᠢᠭ_chirig_
'soldier' should be different because the former word has a masculine
vowel, namely U+1820, and latter doesn't. When written horizontally, the
Mongolian scipt is left-to-right, i.e. upside down compared to its
Aramaic ancestor.  What we need to note is the preceding
'gender'-determining vowel.

There are analogues of THAI CHARACTER SARA AM in the Tai Tham script -
 and
.  In all the examples of
the latter I've seen, U+1A74 is placed over the preceding consonant, so
if U+1A64 is lost through lack of space, the U+1A74 should still
remain.  The former is a matter of style.  Outside Thailand, the mark
above is clearly associated (with one exception) with the U+1A74, so
both can safely vanish together.  In Thailand, the U+1A74 can be
associated with the consonant instead, or hover over the gap between
consonant and vowel.

The exception is the ligature .
That should really only get one cell.  The combination ᨶ᩶ᩣᩴ  'water, fluid' looks like
.

There are then some interesting Indic phenomena depending on how one
treats subscript consonants.  The coding structure  is widespread.

As a lesser from of this, in Khmer  the first consonant and
U+17B6 ligate, and the ligation is highly visible on that
consonant even if the vowel is covered up.  If the display were to
chop off the second consonant, all that need be remembered is the
following vowel.

There is also the repha and analogues.  Repha is graphically a
superscript mark, but is usually encoded as .  Burmese
kinzi is similar, but has a 3-character code.  They really ought to be
associated with the same cell as the immediately following consonant. 

The good news is that the record of the relevant neighbour can be
compressed to a few bits.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Sat, 02 Feb 2019 14:01:46 +0100
Kent Karlsson via Unicode  wrote:

> Den 2019-02-02 12:17, skrev "Egmont Koblinger" :

> > Most terminal emulators handle non-spacing combining marks, it's a
> > piece of cake. (Spacing marks are more problematic.)  

> Well, I guess you may need to put some (practical) limit to the number
> of non-spacing marks (like max two above + max one below; overstrikes
> are an edge case). Otherwise one may need to either increase the line
> height (bad idea for a terminal emulator I think) or the marks start
> to visually interfere with text on other lines (even with the hinted
> limits there may be some interference), also a bad idea for a terminal
> emulator. So I'm not so sure that non-spacing marks is a piece of
> cake... (I.e., need to limit them.)

Doesn't Jerusalem in biblical Hebrew sometime have 3 marks below the
lamedh?  The depth then is the maximum depth, not the sum of the
depths. 

Tai Lue has 'mai sat 3 lem' - that's three marks above for a
combination common enough to have a name.  Throw in the repetition mark
and that's four marks above if you treat the subscript consonant as a
mark (or code it to comply with the USE's erroneous grammar).

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Philippe Verdy via Unicode

Actually not all  U+E0020 through U+E007E are "un-deprecated" for this use.

For now emoji flags only use:
- U+E0041 through U+E005A (mapping to ASCII letters A through Z used in
2-letter ISO3166-1 codes). These are usable in pairs, without requiring any
modifier (and only for ISO3166-1 registered codes).
- I think that U+0030 through U+E0039 (mapping to ASCII digits 0 through 9)
are reserved for ISO3166 extensions, started with only the 3 "countries"
added in the United Kingdom ("ENENG", "ENSCO" and "ENWLS"), with possible
pending additions for other ISO3166-2, but not mapping any dash separator).
These tags are used as modifiers in sequences starting by a leading U+1F3F4

(WAVING
BLACK FLAG) emoji.
- U+E007F (CANCEL TAG) is already used too for the regional extensions as a
mandatory terminator, as seen in the three British countries. It is not
used for country flags made of 2-letter emoji codes without any leading
flag emoji.

And the proposal discussed here to use U+E003C, mapped to the ASCII "<"
LOWER THAN as a leading tag sequence for reencoding HTML tags in sequences
terminated by U+E003E ">" (and containing HTML element names using
lowercase letter tags, possibly digit tags in these names, and "/" for HTML
tags terminator, possibly also U+E0020 SPACE TAG for separating HTML
attributes, U+003D "=" for attribute values, U+E0022 (') or U+E0027 (")
around attribute values, but a problem if the mapped element names or
attributes contain non-ASCII characters...) is not standard (it's just an
experiment in one font), and would in fact not be compatible with the
existing specification for tags.

So only E+E0020 through U+E0040, and U+E005B through U+E007E remain
deprecated.

Le ven. 1 févr. 2019 à 23:26, Doug Ewell via Unicode 
a écrit :

> Richard Wordingham wrote:
>
> > Language tagging is already available in Unicode, via the tag
> > characters in the deprecated plane.
>
> Plane 14 isn't deprecated -- that isn't a property of planes -- and the
> tag characters U+E0020 through U+E007E have been un-deprecated for use
> with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
> deprecated.
>
> --
> Doug Ewell | Thornton, CO, US | ewellic.org
>
>
>

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Kent Karlsson via Unicode



Den 2019-02-02 12:17, skrev "Egmont Koblinger" :

> the font. It's taken from EastAsianWidth (or other means, which we're
> working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9

Yes, that too:
FE0F ? VARIATION SELECTOR-16 = emoji variation selector

But the issue you refer to only deals with U+FE0F. There is also U+FE0E:
FE0E ? VARIATION SELECTOR-15 = text variation selector
which can make a character that is "default emoji" (which are wide)
into "text variant", often single-width, for instance:
1F315 FE0E ; text style;  # (6.0) FULL MOON SYMBOL

---

>> Likewise non-spacing combining characters should
>> be possible to deal reasonably with.
> 
> Most terminal emulators handle non-spacing combining marks, it's a
> piece of cake. (Spacing marks are more problematic.)

Well, I guess you may need to put some (practical) limit to the number
of non-spacing marks (like max two above + max one below; overstrikes
are an edge case). Otherwise one may need to either increase the line
height (bad idea for a terminal emulator I think) or the marks start
to visually interfere with text on other lines (even with the hinted
limits there may be some interference), also a bad idea for a terminal
emulator. So I'm not so sure that non-spacing marks is a piece of cake...
(I.e., need to limit them.)

/Kent K

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

On Sat, Feb 2, 2019 at 12:43 PM Richard Wordingham via Unicode
 wrote:

> I'm not conversant with the details of terminal controls and I haven't
> used fields.  However, where I spoke of lines above, I believe you can
> simply translate it to fields.  I don't know how one best handles
> fields - are they a list, possibly of rows within fields, or are they
> stored as cell attributes?

The very essential is that the terminal emulator stores "cells".
Pretty much all the data (with very few exceptions) resides in cells.

A cell contains a base letter, followed by possibly a few non-spacing
marks. A cell has a foreground color, background color, bold,
underlined, italic etc. properties.

How these cells are linked up, in an array or whatever, is mostly
irrelevant since it's likely to be different in every implementation.

Of course it is possible to extend the per-cell storage to contain a
"previous" and a "next" character, as to be used for shaping purposes
only. Some questions: Is this enough (e.g. aren't there cases where
more than the immediate neighbor are relevant)? Is the next base
character enough, or do we also need to know the combining accents
that belong to that? And can't we store significantly less information
than the actual letter (let's say, 1 out of 13 [randomly made up
number] possible ways of shaping)?

Terminal emulators potentially store a lot of data (some even support
infinite scrolling), and try to handle them in some effective way.
That is, they do all sorts of bitpacking and crazy stuff. E.g. some
might reject adding new attributes when the per-cell size of the
attribute would extend 4 or 8 bytes, both for memory and performance
reasons. Another example: VTE has one global pool of all the base
character + combining accents combos that it has encountered, and
starts assigning single codepoints to them from U+1000 or so, so
that then for each cell the base letter + combining accents still
don't require more storage than 4 bytes.

The takeaway is: the less data we need to remember per cell, the
better, and every bit matters.

But to recap, we're now just peeking into a possible future extension
of the specs to see if it's viable (I guess it is), which I believe
emulators might reasonably decide not to implement, if they think
performance is more important than proper shaping in all the special
cases.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

> Not all terminal emulators can deal with non-spacing combining
> characters.

Both Hebrew and Arabic seem to use non-spacing combining characters,
presumably other Arabic-like scripts too.

I forgot to state explicitly in my docs, but let's just say that
handling non-spacing combining accents is a prerequisite for BiDi
support. Those emulators that don't handle them should be out of scope
for our current discussion.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Richard,

> > Are they okay to be present in visual order (the terminal's explicit
> > mode, what we're discussing now) too?
>
> Where do you define the order for explicit mode?

In explicit mode, the application (Emacs, Vim, whatever) reorders the
characters, and passes visual order (left to right) to the terminal
emulator. The terminal emulator preserves this visual order, doesn't
reshuffle anything.

How to handle ZW(N)J in visual order? What's the desired way? Is it
specified anywhere? As far as I know, they specify the relation
between two adjacent characters of the logical order, which might not
even become adjacent in the visual. Should they always "stick" to the
preceding character, for example?

The Unicode BiDi algorithm doesn't seem to make a difference between
base letters and combining accents for reordering. So, given in an RTL
text a base letter + a combining accent, the BiDi algorithm gives the
visual LTR order of the combining accent first (on the left), followed
by the base letter. This order is not okay for terminal emulators.
Combining accents have to be reordered in the output of the Unicode
BiDi algorithm, so that they come after the base letter even in the
visual LTR order. This is e.g. what FriBidi does by default, due to
the REORDER_NSM flag.

Presumably it doesn't just reorder non-spacing combining accents, but
also ZW(N)J and alike symbols too, which already smells pretty
problematic, doesn't it? Or is this what you need there, too?

> There may be complications in ensuring that
>  gets stored
> as the content of a single cell.

How should the terminal emulator know which cell (the previous or the
subsequent) do these two s belong to?

> > Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> > above.
>
> Example, please.

Cropped strings, cropped strings that are adjacent to each other, and
faulty shaping could kick in there.

Two fields on the UI. One in columns 36-40 with cyan background,
aiming to show ABCDEF, but due to limited room, can only show ABCDE
(let's say it's scrolled horizontally this way). Another in columns
41-45 with yellow background, aiming to show UVWXYZ, but due to
limited space only VWXYZ is shown (it's scrolled horizontally like
this).

What the terminal emulator sees is a continuous text of ABCDEVWXYZ.
What the application wants to have is to get E shaped as if there was
an F on its right, and get V shaped as if there was an U on its left.

Once you address this problem, I'm not sure ZW(N)J are still
required/desireable, rather than applying this more generic solution
there as well.

> At present, VTE positions LTR Indic preceding spacing combining marks
> after the consonant.  I though your draft scheme corrected this very
> local bidi issue, which is so local that the bidi algorithm ignores it.

Indic spacing combining marks are handled incorrectly by VTE and are
being addressed in bug 584160 which I've already linked. This
particular issue I don't consider BiDi at all. It's something totally
different. The spacing accent can be to the right, somewhat on top of
and somewhat to the right, on top of, somewhat to the left and
somewhat on top of, or fully on the left. It's not binary left or
right. Proper rendering should be done by font, and not at all by the
BiDi of the terminal. The terminal is unaware of how much the base
glyph is shifted to the right and the accent to its left. All that the
terminal needs to do (and VTE gets it wrong now) is to pass these two
into whichever font rendering engine in one single step.

> So ព្រះ  LETTER RO, U+17C8 KHMER SIGN > _preah_ 'prefix denoting
> repect for gods, kings, etc.' will be three cells <្រ,ព,ៈ> = <(COENG,
> RA), PO, YUUKALEAPINTU> and cause no confusion?  Or will the cells be
> ?

First it's a base character followed by a non-spacing mark. As in most
terminal emulators (and now we're absolutely not talking about my BiDi
proposal) they are stored in the same cell. The first cell contains
(PO, COENG).

The next two are a base character followed by a spacing mark. In VTE
584160 I outline two possible approaches, but the one I'm in favor of,
is that the row's second cell contains RO and the third cell contains
YUUKALEAPINTU, which two are combined together properly when the
logical contains get displayed. Another possibility which I'm
pondering about is whether the emulation layer should combine them,
that is, have the second cell store the "first half of (RO, YUUKA)"
and the third cell store the "second half of (RO, YUUKA)".

Does this make any sense? If not, could you please explain what and
why is the desired behavior? Please keep in mind that I know nothing
about Khmer in particular.

Anyway, here we're talking about something that's totally independent
from my BiDi work. It's also something that should be standardized
across terminals, sure, but maybe not right now :)


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Richard Wordingham via Unicode

On Fri, 1 Feb 2019 15:15:53 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode
>  wrote:
> 
> > Cropped why?  If the problem is the truncation of lines, one can
> > simple store the next character.  
> 
> Yup, trancation of line for example.
> 
> I agree that one could "store the next character". We could extend the
> terminal emulation protocol where by some means you can specify that
> column 80 contains a letter X, and even though there's no column 81,
> an app can still tell the terminal emulator that it should imagine
> that column 81 contans the letter Y, and perform shaping accordingly.
> 
> This will need to be done not just at the end of the terminal, but at
> any position, and for both directions. Think of e.g. a vertically
> split tmux. You should be able to tell that column 40 contains X which
> should be shaped as if column 41 contained Y, and column 41 contains Z
> which should be shaped as if column 40 contained A.
> 
> What I canont see at all is how this could be "simply". Could you
> please elaborate on that? I don't find this simple at all!

I'm not conversant with the details of terminal controls and I haven't
used fields.  However, where I spoke of lines above, I believe you can
simply translate it to fields.  I don't know how one best handles
fields - are they a list, possibly of rows within fields, or are they
stored as cell attributes?

If one were doing it by cell attributes, and the example above were in
row 6, one might store some of the information below if 'Y' and 'A' do
not appear in the display.

Row 6 column 40: This is end of LTR paragraph, and treat as followed by Y

Row 6 column 41: This is end of RTL paragraph, and treat as followed by A

If storing attributes of rows within fields, the above information would
be stored for the row within the field.

If lines are wrapped, then you would probably want to store that fact
instead and access the character contents indirectly.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-02 Thread Egmont Koblinger via Unicode

Hi Kent,

On Sat, Feb 2, 2019 at 12:41 AM Kent Karlsson via Unicode
 wrote:

> [...] neither of which
> should directly consult the font [...]
> But terminals
> (read terminal emulators) can deal with mixed single width and double
> width characters (which is, IIUC, the motivation for the datafile
> EastAsianWidth.txt).

Yup, exactly; and for this reason, no terminal I'm aware of takes the
single vs. double width property from the font. The logical behavior,
i.e. knowing which logical cell contains what character (or which half
of what character, in case of double wide ones) isn't influenced by
the font. It's taken from EastAsianWidth (or other means, which we're
working on: https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9
, to address e.g. incompatibilities arising from different Unicode
version used by the app vs. the terminal, as you pointed out).

Also think of cases like when the user modifies the font of the
terminal run-time, or a headless terminal emulator, or a screen/tmux
attached to multiple terminal emulators of different fonts at once...
Adjusting the logical behavior according to the font would definitely
be a wrong path to take.

> Likewise non-spacing combining characters should
> be possible to deal reasonably with.

Most terminal emulators handle non-spacing combining marks, it's a
piece of cake. (Spacing marks are more problematic.)

> All sorts of problems arise; feeding
> the emulator a character (or "short" strings) at a time not allowed
> to buffer for display (causing reshaping or movement of already
> displayed characters, edit position movement even within a single
> line, etc.).

Emulators need to update their screen to reflect whatever is in the
logical buffer, and the contents of the logical buffer mustn't depend
on the timing of the incoming data. As a consequence, when the input
stream contains a base character + a combining accent, there is a slim
chance that the base character without the combining accent makes it
to the display for a short time. It's the emulator's job to "fix" it
(that is, redraw the glyph with the combining accent) once the accent
is received. If an emulator doesn't do it correctly, it's simply a bug
in that emulator.

On a side note, we're also working on an extension for atomic updates
at https://gitlab.freedesktop.org/terminal-wg/specifications/issues/9
which should significantly further decrease the chance of such
intermittent screen updates.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Richard Wordingham via Unicode

On Fri, 1 Feb 2019 15:15:53 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode
>  wrote:
> 
> > Cropped why?  If the problem is the truncation of lines, one can
> > simple store the next character.  
> 
> Yup, trancation of line for example.
> 
> I agree that one could "store the next character". We could extend the
> terminal emulation protocol where by some means you can specify that
> column 80 contains a letter X, and even though there's no column 81,
> an app can still tell the terminal emulator that it should imagine
> that column 81 contans the letter Y, and perform shaping accordingly.
> 
> This will need to be done not just at the end of the terminal, but at
> any position, and for both directions. Think of e.g. a vertically
> split tmux. You should be able to tell that column 40 contains X which
> should be shaped as if column 41 contained Y, and column 41 contains Z
> which should be shaped as if column 40 contained A.
> 
> What I canont see at all is how this could be "simply". Could you
> please elaborate on that? I don't find this simple at all!
> 
> >> > It's not able to  
> > > separate different UI elements that happen to be adjacent in the
> > > terminal, separated by different background color or such.  
> >
> > ZWJ and ZWNJ can handle that.  
> 
> Wouldn't it be a semantical misuse of these characters, though?

No.  ZWNJ is used before the inanimate plural suffix of Persian, and in
at least one language,  is used to distinguish one usage from
the digit ٥ (or is it the digit ۵?).

> They are supposed to be present in the logical order, and in logical
> order (that is: the terminal's implicit mode) they can work as
> desired.
> 
> Are they okay to be present in visual order (the terminal's explicit
> mode, what we're discussing now) too?

Where do you define the order for explicit mode?

There may be complications in ensuring that
 gets stored
as the content of a single cell.

> 
> Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined
> above.

Example, please.
> 
> > If a general text manipulating application, e.g. cat, grep or awk,
> > is writing to a file, it should not convert normal Arabic
> > characters to presentation forms.  You are now asking a general
> > application to determine whether it is writing to a terminal or
> > not, and alter its output if it is writing to a terminal.  
> 
> No, this absolutely not what I'm talking about!
> 
> There are two vastly different modes of the terminal. For "cat",
> "grep" etc. the terminal will be in implicit mode. Absolutely no BiDi
> handling is expected from these apps, the terminal will do BiDi and
> shaping (perhaps using Harfbuzz; perhaps using presentation form
> characters as a temporarily low hanging fruit until a better one is
> implemented – the choice is obviously up to the implementation and not
> to the specification).
> 
> For "emacs" and friends, an explicit mode is required where visual
> order is passed to the terminal. What we're discussing is how to
> handle shaping in this mode.

(Partitioning grapheme clusters and Indic syllables)
> > But it as an issue that needs to be addressed.  As a terminal can be
> > addressed by cell, an application may need to keep track of what
> > text went into each cell. Misery results when the application gets
> > it wrong.  
> 
> My recommendation doesn't change this principle at all. In the lower
> (emulation) layer every character still goes into the cell it used to
> go to, and is addressable using cursor motion escapes and so on
> exactly as without BiDi.

At present, VTE positions LTR Indic preceding spacing combining marks
after the consonant.  I though your draft scheme corrected this very
local bidi issue, which is so local that the bidi algorithm ignores it.
 
> 
> 
> > How many cells do CJK ideographs occupy?  We've had a strong hint
> > that a medial BEH should occupy one cell, while an isolated BEH
> > should occupy two.  
> 
> CJK occupy two, but they do regardless of what's around them. That is,
> they already occupy two cells in the logical buffers, in the emulation
> layer.
> 
> There is absolutely no sane way we can make in terminal emulation a
> character's logical width (as in number of cells it occupies) depend
> on its neighboring characters. (And even if we could by some terrible
> hacks, it would break the principle you just said as "misery
> results...", and the principle Eli said that things should remain
> reasonably simple, otherwise hardly anyone will bother implementing
> them.) This is a compromise Arabic folks will have to accept.

So ព្រះ  _preah_ 'prefix denoting
repect for gods, kings, etc.' will be three cells <្រ,ព,ៈ> = <(COENG,
RA), PO, YUUKALEAPINTU> and cause no confusion?  Or will the cells be
?

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Richard Wordingham via Unicode

On Sat, 02 Feb 2019 00:38:04 +0100
Kent Karlsson via Unicode  wrote:

> Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode"
> :

> "Monospaced font" is really a concept with modification. Even for
> "plain old ASCII" there are two advance widths, not just one: 0 for
> control characters (and escape/control sequences, neither of which
> should directly consult the font; even such things as OSC sequences,
> but the latter are a bad idea to have in any line one might wish to
> edit (vi/emacs/...) via a terminal emulator window). But terminals
> (read terminal emulators) can deal with mixed single width and double
> width characters (which is, IIUC, the motivation for the datafile
> EastAsianWidth.txt). Likewise non-spacing combining characters should
> be possible to deal reasonably with.

I remember Michael Everson getting scant sympathy here when he
complained that his 'monospaced' font was rejected as such because
combining characters had zero width.  The rule his font fell foul of
invites distinct NFC and NFD forms of the same string to be rendered
differently; it does not observe the spirit of canonical equivalence.

> It is a lot more difficult to deal with BiDi in a terminal emulator,
> also shaping may be hard to do, as well as reordering (or even
> splitting) combining characters. All sorts of problems arise;...

Which is why Egmont is here looking for comments and advice.

Not all terminal emulators can deal with non-spacing combining
characters. I have recent having unpleasant experiences with what
appears to be Wikimedia's CodeEditor; it expects even non-spacing Thai
vowel marks to have an advance width of one cell.  The text is rendered
in GUI style, i.e. according to the font selected somehow, but the
cursor is positioned according to the character count.  I haven't yet
investigated its treatment of control characters.  I think I'm going to
have to make a font that works to its assumptions.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Richard Wordingham via Unicode

On Fri, 01 Feb 2019 15:18:13 -0700
Doug Ewell via Unicode  wrote:

> Richard Wordingham wrote:
>  
> > Language tagging is already available in Unicode, via the tag
> > characters in the deprecated plane.  
>  
> Plane 14 isn't deprecated -- that isn't a property of planes -- and
> the tag characters U+E0020 through U+E007E have been un-deprecated
> for use with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F
> CANCEL TAG are deprecated.

Unicode may not deprecate the tag characters, but the characters of
Plane 14 are widely deplored, despised or abhorred.  That is why I think
of it as the deprecated plane.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Khaled Hosny via Unicode

On Fri, Feb 01, 2019 at 06:57:43PM +, Richard Wordingham via Unicode wrote:
> On Fri, 1 Feb 2019 13:02:45 +0200
> Khaled Hosny via Unicode  wrote:
> 
> > On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via
> > Unicode wrote:
> > > On Thu, 31 Jan 2019 12:46:48 +0100
> > > Egmont Koblinger  wrote:
> > > 
> > > No.  How many cells do CJK ideographs occupy?  We've had a strong
> > > hint that a medial BEH should occupy one cell, while an isolated
> > > BEH should occupy two.  
> > 
> > Monospaced Arabic fonts (there are not that many of them) are designed
> > so that all forms occupy just one cell (most even including the
> > mandatory lam-alef ligatures), unlike CJK fonts.
> > 
> > I can imagine the terminal restricting itself to monspaced fonts,
> > disable “liga” feature just in case, and expect the font to well
> > behave. Any other magic is likely to fail.
> 
> Of course, strictly speaking, a monospaced font cannot support harakat
> as Egmont has proposed.

There are two approaches for handling them in monospaced fonts;
combining them with base characters as usual, or as spacing characters
placed next to their bases. The later approach is a bit unusual, but
makes editing heavily voweled text a bit more pleasant. It requires good
OpenType support, though, so virtually no terminal supports it.

Regards,
Khaled

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Kent Karlsson via Unicode

Den 2019-02-01 19:57, skrev "Richard Wordingham via Unicode"
:

> On Fri, 1 Feb 2019 13:02:45 +0200
> Khaled Hosny via Unicode  wrote:
> 
>> On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via
>> Unicode wrote:
>>> On Thu, 31 Jan 2019 12:46:48 +0100
>>> Egmont Koblinger  wrote:
>>> 
>>> No.  How many cells do CJK ideographs occupy?  We've had a strong
>>> hint that a medial BEH should occupy one cell, while an isolated
>>> BEH should occupy two.
>> 
>> Monospaced Arabic fonts (there are not that many of them) are designed
>> so that all forms occupy just one cell (most even including the
>> mandatory lam-alef ligatures), unlike CJK fonts.
>> 
>> I can imagine the terminal restricting itself to monspaced fonts,
>> disable ³liga² feature just in case, and expect the font to well
>> behave. Any other magic is likely to fail.
> 
> Of course, strictly speaking, a monospaced font cannot support harakat
> as Egmont has proposed.
> 
> Richard.

(harakat: non-spacing vowel mark in Arabic)

"Monospaced font" is really a concept with modification. Even for
"plain old ASCII" there are two advance widths, not just one: 0 for
control characters (and escape/control sequences, neither of which
should directly consult the font; even such things as OSC sequences,
but the latter are a bad idea to have in any line one might wish to
edit (vi/emacs/...) via a terminal emulator window). But terminals
(read terminal emulators) can deal with mixed single width and double
width characters (which is, IIUC, the motivation for the datafile
EastAsianWidth.txt). Likewise non-spacing combining characters should
be possible to deal reasonably with.

It is a lot more difficult to deal with BiDi in a terminal emulator,
also shaping may be hard to do, as well as reordering (or even
splitting) combining characters. All sorts of problems arise; feeding
the emulator a character (or "short" strings) at a time not allowed
to buffer for display (causing reshaping or movement of already
displayed characters, edit position movement even within a single
line, etc.). Even if solvable for a "GUI" text editor (not via a
terminal), they do not seem to be workable in a terminal (emulator)
setting. Esp. not if one also wants to support multiline editing
(vi/emacs/...) or even single-line editing.

As long as editing is limited to a single line (such as the system
line editor, or an "enhanced functionality" line editor (such as
that used for bash; moving in the history sets the edit position
at EOL) even variable width ("proportional) fonts should not pose
a major problem. But for multiline editors (à la vi/emacs) it would
not be possible to synch nicely (unless one accepts strange jums)
the visual edit position and the actual edit position in the edit
buffer: The program would not have access to the advance width data
from the font that the terminal emulator uses, unless one
revolutionise what terminal emulators do... (And I don't see a
case for doing that.) But both a terminal emulator and multiline
editing programs (for terminal emulators) still can have access
to EastAsianWidth data as well as which characters are non-spacing;
those are not font dependent. (There might be some glitches if
the Unicode versions used do not match (the terminal emulator
and the program being run are most often on different systems),
but only for characters where these properties have changed,
e.g. newly allocated non-spacing marks.)

/Kent K

PS
No, I have not done extensive testing of various terminal emulators
on how well the handle the stuff above.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Ken Whistler via Unicode


Richard,

On 2/1/2019 1:30 PM, Richard Wordingham via Unicode wrote:


Language tagging is already available in Unicode, via the tag characters
in the deprecated plane.


Recte:

1. Plane 14 is not a "deprecated plane".

2. The tag characters in Tag Character block (U+E..U+E007F) are not 
deprecated. (They are used, for example, by UTS #51 to specify emoji tag 
sequences.)


3. However, the use of U+E0001 LANGUAGE TAG and the mechanism of using 
tag characters for spelling out language tags are explicitly deprecated 
by the standard. See: "Deprecated Use for Language Tagging" in Section 
23.9 Tag Characters.


https://www.unicode.org/versions/Unicode11.0.0/ch23.pdf#G30427

and PropList.txt:

E0001 ; Deprecated # Cf   LANGUAGE TAG

As I stated earlier: language tags should use BCP 47, and belong in the 
markup level, not in the plain text stream.


--Ken

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Andrew West via Unicode

On Fri, 1 Feb 2019 at 22:20, Doug Ewell via Unicode  wrote:
>
> Richard Wordingham wrote:
>
> > Language tagging is already available in Unicode, via the tag
> > characters in the deprecated plane.
>
> Plane 14 isn't deprecated -- that isn't a property of planes -- and the
> tag characters U+E0020 through U+E007E have been un-deprecated for use
> with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
> deprecated.

Cancel Tag is not deprecated any longer either
(http://www.unicode.org/Public/UNIDATA/PropList.txt).

Andrew

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Doug Ewell via Unicode

Richard Wordingham wrote:

> Language tagging is already available in Unicode, via the tag
> characters in the deprecated plane.

Plane 14 isn't deprecated -- that isn't a property of planes -- and the
tag characters U+E0020 through U+E007E have been un-deprecated for use
with emoji flags. Only U+E0001 LANGUAGE TAG and U+E007F CANCEL TAG are
deprecated.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Richard Wordingham via Unicode

On Fri, 1 Feb 2019 14:47:22 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Ken,
> 
> > [language tag]
> > That is a complete non-starter for the Unicode Standard.  
> 
> Thanks for your input!
> 
> (I hope it was clear that I just started throwing in random ideas, as
> in a brainstorming session. This one is ruled out, then.)

Language tagging is already available in Unicode, via the tag characters
in the deprecated plane.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Richard Wordingham via Unicode

On Fri, 1 Feb 2019 13:02:45 +0200
Khaled Hosny via Unicode  wrote:

> On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via
> Unicode wrote:
> > On Thu, 31 Jan 2019 12:46:48 +0100
> > Egmont Koblinger  wrote:
> > 
> > No.  How many cells do CJK ideographs occupy?  We've had a strong
> > hint that a medial BEH should occupy one cell, while an isolated
> > BEH should occupy two.  
> 
> Monospaced Arabic fonts (there are not that many of them) are designed
> so that all forms occupy just one cell (most even including the
> mandatory lam-alef ligatures), unlike CJK fonts.
> 
> I can imagine the terminal restricting itself to monspaced fonts,
> disable “liga” feature just in case, and expect the font to well
> behave. Any other magic is likely to fail.

Of course, strictly speaking, a monospaced font cannot support harakat
as Egmont has proposed.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi,

I'm trying to respond to every question, but I'm having a hard time
keeping up :-)

Thanks a lot for all the precious input about shaping!

Here's my suggestion, for version 0.2 of the recommendation:

- No longer encourage any use of presentation form characters.

- State that it's the terminal emulator's task to perform shaping,
both in implicit and explicit modes.

- Leave it for a future enhancement to handle trickier cases in
explicit mode, such as shaping of a word that's only partially
visible, or prevent shaping when two words happen to touch each other
and are visually separated by other means (e.g. background color).
Leave it for further research whether we could use ZWJ/ZWNJ here,
whether we could use ECMA's SAPV 5-8 & 21-11, or whether we should
invent something new (perhaps even telling the terminal emulator what
neighboring previous/next characters to imagine there for the purpose
of shaping)...

Let me know if you have any remaining problems/concerns/etc.

As for the implementation in VTE: initially I'll still use
presentation form characters, solely because that's a low hanging
fruit approach (low investment, high gain). I've already implemented
it in about an hour (a bit of further hacks will be necessary to
extend it to explicit mode, but still easily doable), whereas
switching to HarfBuzz is expected to take weeks of heavy work. We'll
tackle that in a subsequent version. And if anyone's happy to help,
there's already some bounty for harfbuzz support :)

Thanks again for the great guidance!

cheers,
egmont

On Tue, Jan 29, 2019 at 1:50 PM Egmont Koblinger  wrote:
>
> Hi,
>
> Terminal emulators are a powerful tool used by many people for various
> tasks. Most terminal emulators' bugtracker has a request to add RTL /
> BiDi support. Unicode has supported BiDi for about 20 years now.
> Still, the intersection of these two fields isn't solved. Even some
> Unicode experts have stated over time that no one knows how to do it
> properly.
>
> The only documentation I could find (ECMA TR/53) predates the Unicode
> BiDi algorithm, and as such no surprise that it doesn't follow the
> current state of the art or best practices.
>
> Some terminal emulators decided to run the BiDi algorithm for display
> purposes on its lines (rather than paragraphs, uh), not seeing the big
> picture that such a behavior turns them into a platform on top of
> which it's literally impossible to implement proper BiDi-aware text
> editing (vim, emacs, whatever) experience. In turn, vim, emacs and
> friends stand there clueless, not knowing how to do BiDi in terminals.
>
> With about 5 years of experience in terminal emulator development, and
> some prior BiDi homepage developing experience with the kind mentoring
> of one of the BiDi gurus (Aharon, if you're reading this, hi there!),
> I decided to tackle this issue. I studied and evaluated the
> aforementioned documentation and the behavior of such terminals,
> pointed out the problems, and came up with a draft proposal.
>
> My work isn't complete yet. One of the most important pending issues
> is to figure out how to track BiDi control characters (e.g. which
> character cells they belong to), it is to be addressed in a subsequent
> version. But I sincerely hope I managed to get the basics right and
> clean enough so that work can begin on implementing proper support in
> terminal emulators as well as fullscreen text applications; and as we
> gain experience and feedback, extending the spec to address the
> missing bits too.
>
> You can find this (draft) specification at [1]. Feedback is welcome –
> if it's an actionable one then preferably over there in the project's
> bugtracker.
>
> [1] https://terminal-wg.pages.freedesktop.org/bidi/
>
>
> cheers,
> egmont (GNOME Terminal / VTE co-developer)

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Fri, 1 Feb 2019 14:35:35 +0100
> Cc: Frédéric Grosshans , 
>   unicode@unicode.org
> 
> > You could do that, but it will require a lot of non-trivial processing
> > from the applications.  Text-mode applications don't want any complex
> > tinkering, they want just to write their text and be done.  The more
> > overhead you add to that simple task, the less probable it is that
> > applications will support such a terminal.
> 
> I agree with your overall observation, but I'm not sure how much it
> applies to this context.
> 
> Text-mode applications have to run the BiDi algorithm. The one I
> picked can also do shaping (well, the pretty limited one, using
> presentation forms).

Reordering and shaping have different requirements.  Reordering can be
done based only on the codepoints, whereas shaping needs also intimate
knowledge of the fonts being used.  The former can be done by a
text-mode application, the latter cannot, not anywhere close to what
readers of the respective scripts would expect.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi Richard,

On Fri, Feb 1, 2019 at 12:19 AM Richard Wordingham via Unicode
 wrote:

> Cropped why?  If the problem is the truncation of lines, one can simple
> store the next character.

Yup, trancation of line for example.

I agree that one could "store the next character". We could extend the
terminal emulation protocol where by some means you can specify that
column 80 contains a letter X, and even though there's no column 81,
an app can still tell the terminal emulator that it should imagine
that column 81 contans the letter Y, and perform shaping accordingly.

This will need to be done not just at the end of the terminal, but at
any position, and for both directions. Think of e.g. a vertically
split tmux. You should be able to tell that column 40 contains X which
should be shaped as if column 41 contained Y, and column 41 contains Z
which should be shaped as if column 40 contained A.

What I canont see at all is how this could be "simply". Could you
please elaborate on that? I don't find this simple at all!

>> > It's not able to
> > separate different UI elements that happen to be adjacent in the
> > terminal, separated by different background color or such.
>
> ZWJ and ZWNJ can handle that.

Wouldn't it be a semantical misuse of these characters, though?

They are supposed to be present in the logical order, and in logical
order (that is: the terminal's implicit mode) they can work as
desired.

Are they okay to be present in visual order (the terminal's explicit
mode, what we're discussing now) too?

Anyway, ZWJ/ZWNJ aren't sufficient to handle the cases I outlined above.

> If a general text manipulating application, e.g. cat, grep or awk, is
> writing to a file, it should not convert normal Arabic characters to
> presentation forms.  You are now asking a general application to
> determine whether it is writing to a terminal or not, and alter its
> output if it is writing to a terminal.

No, this absolutely not what I'm talking about!

There are two vastly different modes of the terminal. For "cat",
"grep" etc. the terminal will be in implicit mode. Absolutely no BiDi
handling is expected from these apps, the terminal will do BiDi and
shaping (perhaps using Harfbuzz; perhaps using presentation form
characters as a temporarily low hanging fruit until a better one is
implemented – the choice is obviously up to the implementation and not
to the specification).

For "emacs" and friends, an explicit mode is required where visual
order is passed to the terminal. What we're discussing is how to
handle shaping in this mode.

> But it as an issue that needs to be addressed.  As a terminal can be
> addressed by cell, an application may need to keep track of what text
> went into each cell. Misery results when the application gets it wrong.

My recommendation doesn't change this principle at all. In the lower
(emulation) layer every character still goes into the cell it used to
go to, and is addressable using cursor motion escapes and so on
exactly as without BiDi.

> How many cells do CJK ideographs occupy?  We've had a strong hint
> that a medial BEH should occupy one cell, while an isolated BEH should
> occupy two.

CJK occupy two, but they do regardless of what's around them. That is,
they already occupy two cells in the logical buffers, in the emulation
layer.

There is absolutely no sane way we can make in terminal emulation a
character's logical width (as in number of cells it occupies) depend
on its neighboring characters. (And even if we could by some terrible
hacks, it would break the principle you just said as "misery
results...", and the principle Eli said that things should remain
reasonably simple, otherwise hardly anyone will bother implementing
them.) This is a compromise Arabic folks will have to accept.

When displayed, it's up for terminal emulators to perhaps
enwiden/shrink cells as it wants to (they might even totally give up
on monospace fonts), but then they'll risk vertical lines not aligning
up perfectly vertically, content overflowing on the right etc. Konsole
does such things.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi Ken,

> [language tag]
> That is a complete non-starter for the Unicode Standard.

Thanks for your input!

(I hope it was clear that I just started throwing in random ideas, as
in a brainstorming session. This one is ruled out, then.)

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

On Thu, Jan 31, 2019 at 4:26 PM Eli Zaretskii  wrote:

> > Yes, I do argue that emacs will need to print a new escape sequence.
> > Which is much-much-much-much-much better than having to tell users to
> > go into the settings of their macOS Terminal / Konsole /
> > gnome-terminal etc. and disable BiDi there, isn't it?
>
> I'm not sure I agree.  Most users can disable bidi reordering of the
> terminal once and for all.  They don't need it.

What users are we talking about? Those who don't need BiDi ever at all?

Everything is already perfect for them! They should't care about the
"enable BiDi" settings of their terminal, either value will result in
the same, correct behavior for them.

Or do we talk about users who care about BiDi inside Emacs, but don't
care about BiDi when echo'ing, cat'ing...? Do such users exist? Well,
even if they do, they're not the only target of my work.

Remember: My proposal aims to address both the Emacs as well as the
echo/cat/... use cases. These are substantially different use cases
that require the terminal emulator to be in a different mode, and thus
automatic switching between the two modes has to be solved.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Fri, 1 Feb 2019 14:16:03 +0100
> Cc: Adam Borowski , unicode@unicode.org
> 
> There's absolutely no way we could reorder first, and then handle
> TAB's cursor movement. TAB's cursor movement happens in the lower
> layer, reordering happens in the upper one.

But that means you won't ever be able to be in compliance with UAX#9,
because TAB has distinct properties that affect the UBA.  If you
reorder after all TABs have been converted to spaces, you will not be
able to implement the support for Segment Separator characters.  Am I
missing something?

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi,

On Thu, Jan 31, 2019 at 4:14 PM Eli Zaretskii  wrote:>

> I suggest that you show the result to someone who does read Arabic.

I contacted one guy who is pretty knowledgeable in Arabic scripts, as
well as terminal emulation, I sent out an early unpublished version of
the proposal to him, but unfortunately he was busy and didn't have the
chance to respond. Let this thread be one where we invite Arabic folks
to comment :)

> Small changes can be very unpleasant to the eyes of an Arabic reader.

I can easily imagine that!

I can assure you, seeing õ instead of ő in my native language is
extremely unpleasant to my eyes. Depending on the font you're using,
you may not even have spotted any difference.

But could someone argue for example that seeing an "i" and "w" equally
wide is unpleasant to their eyes? Where do we draw the lines of what's
an acceptable compromise on a platform that has technical limitations
(fixed grid) to begin with? We really need input from Arabic folks to
answer this.

I'm also wondering: how unpleasant it is if a letter is cut in half
(e.g. overflows at the edge of the text editor), and is shaped not
according to the entire word but according to the visible part? I took
it from the CSS specification that the desired behavior is to shape it
according to the entire word, but I honestly don't know how acceptable
or how unpleasant the other approach is.

> You could do that, but it will require a lot of non-trivial processing
> from the applications.  Text-mode applications don't want any complex
> tinkering, they want just to write their text and be done.  The more
> overhead you add to that simple task, the less probable it is that
> applications will support such a terminal.

I agree with your overall observation, but I'm not sure how much it
applies to this context.

Text-mode applications have to run the BiDi algorithm. The one I
picked can also do shaping (well, the pretty limited one, using
presentation forms). Shouldn't any BiDi algorithm also provide methods
for shaping that produce some output that can be easily sent to the
terminals? Shouldn't we push for them?

As far as I imagine the ideal solution, doing this part of shaping
shouldn't be any harder for apps than doing BiDi, basically all they
would need to do is hook up to existing API methods.

Of course, given the current APIs, it's probably really not this simple.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Fri, 1 Feb 2019 13:54:02 +0100
> Cc: Adam Borowski , unicode@unicode.org
> 
> For this behavior, the only feature you need from a terminal emulator
> is to have a mode where it doesn't shuffle the characters. Currently
> every emulator I'm aware of has such a mode, although in some of them
> you have to tweak the settings to get to this mode (in my firm opinion
> it's an unacceptable user experience), while in emulators according to
> my specification there'll be an escape sequence for text-mode apps to
> automatically switch to this mode.

Like I said, as long as not every emulator supports this control, an
application will need to detect its support, and that in itself is a
complication.

> > This is indeed a significant issue, because it means applications
> > cannot force the terminal use a certain non-default base paragraph
> > direction.
> 
> They can, since there's a dedicated escape sequence (SCP) for setting
> the base paragraph.

Does this change the base direction globally for the whole screen, or
only for the current text?  The latter is what's needed.

And again, just detecting whether this is supported is a
complication.  Emitting LRM or RLM as needed is much easier.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Fri, 1 Feb 2019 13:40:48 +0100
> Cc: unicode@unicode.org
> 
> I now understand that presentation forms isn't an ideal possible
> approach, and the recommendation should be improved here.
> 
> Until it happens, I'm uncertain whether using presentation form
> characters is a decent low hanging fruit that significantly improves
> the readability in some situations (e.g. "good enough" in some sense
> for Arabic), or is a dead end we shouldn't propagate.

IMNSHO, you shouldn't try solving this problem on your own.  Instead,
use a shaping engine, such as HarfBuzz, to do that for you, since the
emulator does know which fonts it uses, and can access their
properties.  The only problem a terminal emulator does need to solve
in this regard is what to do when N codepoints yield M /= N glyphs
that the shaping engine tells you to emit, or, more generally, when
the width on display after shaping is different from N times the
character cell width.

> I still do not agree however that the entire responsibility can be
> shifted to the emulator. There are certain important bits of
> information that are only available to the application, and not the
> emulator – as with many other aspects, such as reordering,
> copy-pasting, searching in the data in BiDi-aware text editors using
> the terminal's explicit mode, which are all pushed to the application
> because the emulator cannot do them correctly.

As soon as you attempt to target applications that move cursor and use
cursor addressing, you are in trouble, and should IMO refrain from
trying to support such applications.  For example, Emacs doesn't even
write whole lines to the screen, it compares the internal
representation of what's on the screen and what should be there, and
only emits the parts that should be modified.  (It does that to
minimize screen writes, which might be expensive, especially if
writing to a remote terminal.)  In such cases, the emulator doesn't
stand a chance of doing TRT, because the application doesn't provide
enough context for it to reorder text correctly.

So I don't think a bidi-aware terminal emulator can support any
application more complex than those which write full lines to the
terminal, like 'cat', 'sed', 'diff', 'grep', etc.

> I believe we should further study the situation, e.g. see whether
> ECMA-48's SAPV (8.3.18) parameters 5..8 (to explicitly specify whether
> to use isolated/initial/medial/final form for each character) are
> flexible enough to convey all this information, or perhaps a new, more
> powerful means should be crafted.

Once again, I think it's impractical to expect applications to emit
these controls.  The emulator must do this part of the job.

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi,

On Thu, Jan 31, 2019 at 4:10 PM Eli Zaretskii  wrote:

> The reordering happens before TABs are converted to cursor motion,
> does it not?

No, not at all.

You cannot "mix" handling the input and reordering, since the input is
not available as a single step but arrives continuously in a stream.

Consider a heavy BiDi text such as (I'm making up some random
gibberish, uppercase being RTL):
foo BAR FSI BAz quUX 1234 PDI whatEVer

Someone prints it to the terminal, but due to the internals, the
terminal doesn't receive this in one single step but in two
consecutive ones, broken in the middle. Maybe the app split it in half
(e.g. a shell script printed fragments one by one using printf without
a trailing newline). Maybe the emitter is a "dd" printing blocks of
let's say 4kB and this line happens to cross a boundary. Maybe a
transport layer such as ssh split it for whatever reason.

Then would you take the first half of this text, let's say
foo BAR FSI BAz quU
even with unbalanced BiDi controls, then reorder it, and continue from
it? Continue how? How to remember to reorder the second half too, but
not the first half once again in order to avoid "double BiDi"?

What to do with explicit cursor movement, would they jump to the
visual positon? This would break absolutely basic principles, e.g.
jumping twice to the same location to overwrite a letter twice in a
row may actually end up overwriting two different letters, since
everything was potentially rearranged after the first overwrite
happened? Any application having any existing preconception about
cursor movement would uncontrollably fall apart.

This approach is doomed to fail big time (and was the reason I had to
drop ECMA TR/53's DCSM "presentation" mode).

The only reasonable way is if you have two layers. The bottom layer
does the emulation almost exactly as it used to do, with no BiDi
whatsoever (except for tiny additions, e.g. it tracks BiDi-related
properties such as the paragraph direction). The upper layer displays
the data, and this upper layer performs BiDi solely for display
purposes: using the lower layer's data as input, but not modifying it.

This is, by the way, also what current emulators that shuffle the
characters arond do.

Let's also mention that the lower layer (emulation) should be as fast
as possible. e.g. VTE can handle input in the ballpark of 10MB/s.
Reordering, that is, running BiDi for display purposes needs to happen
much more rarely, maybe 20-60 times per second. It would be a
performance killer having to run the BiDi algorithm upon every
received chunk of data – in fact, to eliminate any possible behavior
difference due to timing difference, it'd need to happen after every
printable character received.

There's absolutely no way we could reorder first, and then handle
TAB's cursor movement. TAB's cursor movement happens in the lower
layer, reordering happens in the upper one.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi Eli,

> So we will some day have one such terminal emulator.  That's good, but
> a text-mode application that needs to support bidi cannot rely on its
> users all having access to that single terminal.

No. A text-mode application that needs to support BiDi must do the
BiDi itself and pass visual order to the emulator, and beforehand
switch the emulator to explicit mode so that you don't end up with
"double BiDi". Once you emit visual order, there's no need for any
BiDi control characters.

For this behavior, the only feature you need from a terminal emulator
is to have a mode where it doesn't shuffle the characters. Currently
every emulator I'm aware of has such a mode, although in some of them
you have to tweak the settings to get to this mode (in my firm opinion
it's an unacceptable user experience), while in emulators according to
my specification there'll be an escape sequence for text-mode apps to
automatically switch to this mode.

What BiDi control characters (LRE, LRI, FSI etc.) in implicit mode
will give you – if supported – is that you'll be able to execute "cat
file", and it'll be displayed correctly, even taking FSI and friends
as present in the file into account. Of course this will only work in
terminal emulators that support this.

> This is indeed a significant issue, because it means applications
> cannot force the terminal use a certain non-default base paragraph
> direction.

They can, since there's a dedicated escape sequence (SCP) for setting
the base paragraph.

That being said, not being able to remember FSI at the beginning of a
string is indeed a significant issue, we agree on this. We just need
to figure out how to alter the emulation behavior to remember them,
which I find the next big step to address in the specification.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Egmont Koblinger via Unicode

Hi Eli,

> Arabic presentation forms are more like an exception than a rule, I
> hope you understand this by now.  Most languages/scripts don't have
> such forms, and even for Arabic they cover only a part of what needs
> to be done to present correctly shaped text.  Complex script shaping
> is much more than just substituting some glyphs with others, it
> requires an intimate knowledge of the font being used and its
> capabilities, and the ability to control how various glyphs of a
> grapheme cluster are placed relative to one another, something that an
> application running on a text terminal cannot do.
>
> So I suggest that you don't consider Arabic presentation forms a
> representative of the direction in which terminal emulators supporting
> such scripts should evolve.

Thanks a lot for this information!

I now understand that presentation forms isn't an ideal possible
approach, and the recommendation should be improved here.

Until it happens, I'm uncertain whether using presentation form
characters is a decent low hanging fruit that significantly improves
the readability in some situations (e.g. "good enough" in some sense
for Arabic), or is a dead end we shouldn't propagate.

I still do not agree however that the entire responsibility can be
shifted to the emulator. There are certain important bits of
information that are only available to the application, and not the
emulator – as with many other aspects, such as reordering,
copy-pasting, searching in the data in BiDi-aware text editors using
the terminal's explicit mode, which are all pushed to the application
because the emulator cannot do them correctly.

I believe we should further study the situation, e.g. see whether
ECMA-48's SAPV (8.3.18) parameters 5..8 (to explicitly specify whether
to use isolated/initial/medial/final form for each character) are
flexible enough to convey all this information, or perhaps a new, more
powerful means should be crafted. At this point I lack sufficient
knowledge to fix the design, I'd need to spend a lot of time studying
the situation and/or working together with you guys, if you're up for
it.


Thanks a lot,
egmont

Re: Proposal for BiDi in terminal emulators

2019-02-01 Thread Khaled Hosny via Unicode

On Thu, Jan 31, 2019 at 11:17:19PM +, Richard Wordingham via Unicode wrote:
> On Thu, 31 Jan 2019 12:46:48 +0100
> Egmont Koblinger  wrote:
> 
> No.  How many cells do CJK ideographs occupy?  We've had a strong hint
> that a medial BEH should occupy one cell, while an isolated BEH should
> occupy two.

Monospaced Arabic fonts (there are not that many of them) are designed
so that all forms occupy just one cell (most even including the mandatory
lam-alef ligatures), unlike CJK fonts.

I can imagine the terminal restricting itself to monspaced fonts,
disable “liga” feature just in case, and expect the font to well behave.
Any other magic is likely to fail.

Regards,
Khaled

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> Date: Thu, 31 Jan 2019 23:17:19 +
> From: Richard Wordingham via Unicode 
> 
> Emacs needs a lot of help - I can't write a generic Tai Tham
> OpenType .flt file :-(

Which is why Emacs is migrating towards HarfBuzz.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Richard Wordingham via Unicode

On Thu, 31 Jan 2019 12:46:48 +0100
Egmont Koblinger  wrote:

> Hi Richard,
> 
> > Basic Arabic shaping, at the level of a typewriter, is
> > straightforward enough to leave to a terminal emulator, as Eli has
> > suggested.  
> 
> What is "basic" Arabic shaping exactly?

Just using initial, medial and final forms, with no vertical stacking,
In terms of glyphs, none of glyphs of the presentation forms with
'LIGATURE' in the name would be used.

> I can see problems with leaving it to a terminal. It's not aware of
> the neighboring character if the string is cropped.

Cropped why?  If the problem is the truncation of lines, one can simple
store the next character.

> It's not able to
> separate different UI elements that happen to be adjacent in the
> terminal, separated by different background color or such.

ZWJ and ZWNJ can handle that.

> On the other hand, let's reverse the question:
> 
> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?

Apart from using presentation form characters in raw text being a sin?

If a general text manipulating application, e.g. cat, grep or awk, is
writing to a file, it should not convert normal Arabic characters to
presentation forms.  You are now asking a general application to
determine whether it is writing to a terminal or not, and alter its
output if it is writing to a terminal.  If the terminal window is
actually an emacs text buffer, I would not want such output to be
converted.  It is entirely natural to convert an emacs text buffer to
a file. 

> > I believe combining marks present issues even in implicit modes.  In
> > implicit mode, one cannot simply delegate the task to normal text
> > rendering, for one has to allocate text to cells.  There are a
> > number of complications that spring to mind:
> >
> > 1) Some characters decompose to two characters that may otherwise
> > lay claim to their own cells:
> >
> > U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to
> > <06D2,  
> > 0654>.  Do you intend that your scheme be usable by
> > Unicode-compliant processes?  
> 
> Decompose during which step? During shaping?
> 
> Or do you mean they are NFC-NFD counterparts of each other?

The latter.

> > 4) Indic conjuncts.
> > (i) There are some conjuncts, such as Devanagari K.SSA, where a
> > display as ,  is simply unacceptable.  In some
> > closely related scripts, this conjunct has the status of a
> > character.  
> 
> We (in GNOME Terminal / VTE) do have an open bug about Devanagari
> spacing marks (currently they don't show up properly), plus Virama and
> friends. I'd like to address the essentials along with the BiDi
> implementation; although here we should discuss the design and not a
> particular implementation thereof :)
> 
> In case you're interested, at
> https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
> and perhaps a few others comments I wondered whether certain joining
> operations should be done on the emulation layer or the display layer.
> The answer is not yet clear. We can't fix suddenly everything, but
> it's nice to move forward step by step. It's also proposed that we
> used HarfBuzz, but it's unclear to me at this point how the grid
> alignment could be preserved in the mean time.

Thanks for the link.

There are two different beasties.  There are text windows into which
the user and the application communicate using text, and this text
tends to be rendered properly, as one might aim to do with HarfBuzz,
and as an Emacs text buffer running the shell tries to do.  (Emacs
needs a lot of help - I can't write a generic Tai Tham OpenType .flt
file :-(  In my opinion, these are highly appropriate for application
like diff, grep and cat.  Do we have a good name for them/  They are,
perhaps, 'teletype emulators'.

> "simply unacceptable" – I'm not familiar with those languages,
> cultures and so on, but I'd be hesitant to go as far as calling
> anything "unacceptable". E.g. there's a physical typewriter in our
> family, as far as I remember it has no digits 1 or 0 (use the letters
> lowercase L and anycase O instead), it doesn't contain all the
> accented letters of my mother tounge so sometimes a similarly looking
> one has to be used. In today's computer world, I'd say such
> limitations are "unacceptable", but at that time this was what we had
> to live with.
> 
> Terminal emulators, due to their strict character grid nature and
> their legacy behavior of many decades, are a platform where a certain
> level of compromise might be necessary for some scripts. I cannot tell
> where to draw the line, cannot tell what is "extremely bad" vs. "not
> nice" vs. "kind of okay but could be better", but we can't do
> everything in a terminal emulator that a graphical app could do. If
> someone wants to have a pixel perfect look, terminal emulators are

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Richard Wordingham via Unicode

On Thu, 31 Jan 2019 08:28:41 +
Martin J. Dürst via Unicode  wrote:

> > Basic Arabic shaping, at the level of a typewriter, is
> > straightforward enough to leave to a terminal emulator, as Eli has
> > suggested.  Lam-alif would be trickier - one cell or two?  
> 
> Same for other characters. A medial Beh/Teh/Theh/... (ببب) in any 
> reasonably decent rendering should take quite a bit less space than a 
> Seen or Sheen (سسس). I remember that the multilingual Emacs version 
> mostly written by Ken'ichi Handa (was it called mEmacs or nEmacs or 
> something like that?) had different widths only just for Arabic. In 
> Thunderbird, which is what I'm using here, I get hopelessly 
> stretched/squeezed glyph shapes, which definitely don't look good.

It's a long time since I last knowingly read typewritten Arabic script,
but on reading the description of Haddad's design of the Arabic
typewriter, I see what you mean.  My point is correct, but your point
is another argument for having single- and double-width characters.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Doug Ewell via Unicode

Egmont Koblinger wrote:

> "Basic Arabic shaping, at the level of a typewriter, is
> straightforward enough to be implemented in the application, using
> presentation form characters, as I suggest". Could you please point
> out the problems with this statement?

As multiple people have pointed out, Arabic presentation forms don't
cover the whole Arabic script and are not generally recommended for new
applications, though they are not formally deprecated.

If you take a look at the parallel discussion about italics in plain
text, you will see a corollary in the use of Mathematical Alphanumeric
Symbols: they look tempting and are (usually) easy to render, but among
other things, they only cover [A-Za-zıȷΑ-Ωα-ω] and thus miss much
of the text that may need to be italicized.

--
Doug Ewell | Thornton, CO, US | ewellic.org

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Ken Whistler via Unicode




On 1/31/2019 1:41 AM, Egmont Koblinger via Unicode wrote:

I mean, for
example we can introduce control characters that specify the language.


That is a complete non-starter for the Unicode Standard. And if the 
terminal implementation introduces such as one-off hacks, they will fail 
completely for interoperability.


https://en.wikipedia.org/wiki/IETF_language_tag

That belongs to the markup level, not to the plain text stream.

--Ken

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> Date: Thu, 31 Jan 2019 10:58:54 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> Yes, I do argue that emacs will need to print a new escape sequence.
> Which is much-much-much-much-much better than having to tell users to
> go into the settings of their macOS Terminal / Konsole /
> gnome-terminal etc. and disable BiDi there, isn't it?

I'm not sure I agree.  Most users can disable bidi reordering of the
terminal once and for all.  They don't need it.

If terminals supported some control sequence to turn on and off the
reordering, it might be a useful feature to support such sequences.
But IME just querying the emulator whether it supports that or not is
a hassle, and generally slows down the application startup.  So it's a
mixed blessing.

> > On the other hand, all that the program can output is a sequence of Unicode
> > codepoints.  These don't include shaping information
> 
> With "presentation form" characters, yes, they can, they do including
> shaping information.

Let's please stop talking about presentation forms, they solve only a
small part of the shaping problem.

> > and it's
> > the terminal who's equipped with _most_ of the needed data
> 
> Why? It's the app that knows the context characters, it's the app that
> knows the language.

But only the emulator knows which fonts it uses, and only the emulator
can access the information about the font, like what OTF features it
supports, what glyphs it has, etc.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Thu, 31 Jan 2019 10:41:02 +0100
> Cc: Frédéric Grosshans , 
>   unicode@unicode.org
> 
> > Personally, I think we should simply assume that complex script
> > shaping is left to the terminal, and if the terminal cannot do that,
> > then that's a restriction of working on a text terminal.
> 
> I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
> of experimenting with picking random Arabic words, displaying in the
> terminal their unshaped as well as shaped (using presentation form
> characters) variants, and compared them to pango-view's (harfbuzz's)
> rendering.
> 
> To my eyes the version I got in the terminal with the presentation
> form characters matched the "expected" (pango-view) rendering
> extremely closely.

I suggest that you show the result to someone who does read Arabic.
Small changes can be very unpleasant to the eyes of an Arabic reader.

As for Arabic presentation forms, I already explained why they cannot
be considered a solution that is anywhere near the complete one.

> > OTOH a terminal emulator who wants to perform shaping needs
> > information from the application
> 
> And the presentation form characters are pretty much exactly that
> information, aren't they (for Arabic)?

Much more is needed for correct shaping.

> Instead of saying that it's not possible, could we perpahs try to
> solve it, or at least substantially improve the situation? I mean, for
> example we can introduce control characters that specify the language.
> We can introduce a flag that tells the terminal whether to do shaping
> or not. There are probably plenty of more ideas to be thrown in for
> discussion and improvement.

You could do that, but it will require a lot of non-trivial processing
from the applications.  Text-mode applications don't want any complex
tinkering, they want just to write their text and be done.  The more
overhead you add to that simple task, the less probable it is that
applications will support such a terminal.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Thu, 31 Jan 2019 10:28:27 +0100
> Cc: Adam Borowski , unicode@unicode.org
> 
> On Wed, Jan 30, 2019 at 5:10 PM Eli Zaretskii  wrote:
> 
> > I think the application could use TAB characters to get to the next
> > cell, then simplistic reordering would also work.
> 
> TAB handling is extremely complicated, because in terminal emulation
> TAB is not a character, TAB is a control instruction (like escape
> sequences) that moves the cursor (and jumps through the existing
> content, if any, without erasing it).

The reordering happens before TABs are converted to cursor motion,
does it not?  If so, their effect on reordering, by virtue of the TAB
being Segment Separator for the UBA purposes, could happen
nonetheless.  Right?

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Thu, 31 Jan 2019 10:21:52 +0100
> Cc: Adam Borowski , unicode@unicode.org
> 
> > Does anyone know of a terminal emulator which supports isolates?
> 
> GNOME Terminal's (VTE's) current work-in-progress implementation does
> remember BiDi control characters just like it remembers combining
> accents, that is, tied to the preceding letter's cell. It uses FriBidi
> 1.0 for the BiDi work, so yes, it supports Unicode 6.3's isolates.

So we will some day have one such terminal emulator.  That's good, but
a text-mode application that needs to support bidi cannot rely on its
users all having access to that single terminal.

> There's one significant issue, though. Because we currently just
> misuse our existing infrastructure of combining accents for the BiDi
> controls, BiDi controls at the very beginning of a paragraph are
> dropped. Addressing this issue would need core changes to the terminal
> emulation behavior, such as introducing in-between-cells storage, or
> zero-width special characters belonging to a cell _before_ the cell's
> actual letter, or something like this. I outline one idea in my
> specification, but it's subject to discussion to finalize it.

This is indeed a significant issue, because it means applications
cannot force the terminal use a certain non-default base paragraph
direction.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Thu, 31 Jan 2019 10:11:22 +0100
> Cc: unicode@unicode.org
> 
> > It doesn't do _any_ shaping.  Complex script shaping is left to the
> > terminal, because it's impossible to do shaping in any reasonable way
> > [...]
> 
> Partially, you are right. On the other hand, as far as I know, shaping
> should take into account the neighboring glyphs even if those are not
> visible (e.g. overflow from the viewport), and the terminal is unaware
> of what those glyps are. This is an area that "presentation form"
> characters can address for Arabic – although as it was pointed out,
> not for Syrian and some others.

Arabic presentation forms are more like an exception than a rule, I
hope you understand this by now.  Most languages/scripts don't have
such forms, and even for Arabic they cover only a part of what needs
to be done to present correctly shaped text.  Complex script shaping
is much more than just substituting some glyphs with others, it
requires an intimate knowledge of the font being used and its
capabilities, and the ability to control how various glyphs of a
grapheme cluster are placed relative to one another, something that an
application running on a text terminal cannot do.

So I suggest that you don't consider Arabic presentation forms a
representative of the direction in which terminal emulators supporting
such scripts should evolve.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Frédéric Grosshans via Unicode


Le 31/01/2019 à 10:41, Egmont Koblinger a écrit :

Hi,


Personally, I think we should simply assume that complex script
shaping is left to the terminal, and if the terminal cannot do that,
then that's a restriction of working on a text terminal.

I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
of experimenting with picking random Arabic words, displaying in the
terminal their unshaped as well as shaped (using presentation form
characters) variants, and compared them to pango-view's (harfbuzz's)
rendering.

To my eyes the version I got in the terminal with the presentation
form characters matched the "expected" (pango-view) rendering
extremely closely. Of course there's still some tradeoffs due to fixed
with cells (just as in English, arguably an "i" and "w" having the
same width doesn't look as nice as with proportional fonts). In the
mean time, the unshaped characters looks vastly differently.


OTOH a terminal emulator who wants to perform shaping needs
information from the application

And the presentation form characters are pretty much exactly that
information, aren't they (for Arabic)?


There's nothing you can do here [...] there's no way for the application to 
provide

Instead of saying that it's not possible, could we perpahs try to
solve it, or at least substantially improve the situation? I mean, for
example we can introduce control characters that specify the language.
We can introduce a flag that tells the terminal whether to do shaping
or not. There are probably plenty of more ideas to be thrown in for
discussion and improvement.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi Richard,

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.

What is "basic" Arabic shaping exactly?

I can see problems with leaving it to a terminal. It's not aware of
the neighboring character if the string is cropped. It's not able to
separate different UI elements that happen to be adjacent in the
terminal, separated by different background color or such.

On the other hand, let's reverse the question:

"Basic Arabic shaping, at the level of a typewriter, is
straightforward enough to be implemented in the application, using
presentation form characters, as I suggest". Could you please point
out the problems with this statement?

> I believe combining marks present issues even in implicit modes.  In
> implicit mode, one cannot simply delegate the task to normal text
> rendering, for one has to allocate text to cells.  There are a number
> of complications that spring to mind:
>
> 1) Some characters decompose to two characters that may otherwise lay
> claim to their own cells:
>
> U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
> 0654>.  Do you intend that your scheme be usable by Unicode-compliant
> processes?

Decompose during which step? During shaping?

Or do you mean they are NFC-NFD counterparts of each other?

Most terminal emulators are able to handle combining accents, and of
course implicit mode would take them into account when rearranging the
letters. Terminal emulators don't do explicit (de)composing, a.k.a.
NFC->NFD or NFD->NFC conversion (at least I'm not aware of any that
does).

> 4) Indic conjuncts.
> (i) There are some conjuncts, such as Devanagari K.SSA, where a
> display as ,  is simply unacceptable.  In some
> closely related scripts, this conjunct has the status of a character.

We (in GNOME Terminal / VTE) do have an open bug about Devanagari
spacing marks (currently they don't show up properly), plus Virama and
friends. I'd like to address the essentials along with the BiDi
implementation; although here we should discuss the design and not a
particular implementation thereof :)

In case you're interested, at
https://bugzilla.gnome.org/show_bug.cgi?id=584160 comments 45-48, 95
and perhaps a few others comments I wondered whether certain joining
operations should be done on the emulation layer or the display layer.
The answer is not yet clear. We can't fix suddenly everything, but
it's nice to move forward step by step. It's also proposed that we
used HarfBuzz, but it's unclear to me at this point how the grid
alignment could be preserved in the mean time.

"simply unacceptable" – I'm not familiar with those languages,
cultures and so on, but I'd be hesitant to go as far as calling
anything "unacceptable". E.g. there's a physical typewriter in our
family, as far as I remember it has no digits 1 or 0 (use the letters
lowercase L and anycase O instead), it doesn't contain all the
accented letters of my mother tounge so sometimes a similarly looking
one has to be used. In today's computer world, I'd say such
limitations are "unacceptable", but at that time this was what we had
to live with.

Terminal emulators, due to their strict character grid nature and
their legacy behavior of many decades, are a platform where a certain
level of compromise might be necessary for some scripts. I cannot tell
where to draw the line, cannot tell what is "extremely bad" vs. "not
nice" vs. "kind of okay but could be better", but we can't do
everything in a terminal emulator that a graphical app could do. If
someone wants to have a pixel perfect look, terminal emulators are not
for them. Maybe looking at typewriters of those scripts could be a
good starting point. Anyway, we've drifted quite far away.


What I've already implemented in VTE (in a work-in-progress branch),
and to my eyes looks quite nice, is Arabic shape using presentation
form characters as done by FriBidi (in implicit mode only). According
to the API of this library, this shaping process keeps a 1:1 mapping
between the original and shaped letters (at least the number of
Unicode codepoints – I haven't double checked their terminal width,
but I really hope they don't mess with us here). That is, I don't have
to deal with a character cell splitting into two, or two character
cells joining into one during shaping. Does this sound okay so far?


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

On Thu, Jan 31, 2019 at 10:05 AM Richard Wordingham via Unicode
 wrote:

> > How will "ls -l" possibly work?  This is an example of the "table"
> > layout you were already discussing.
>
> I think the answer is that it will use the same trickery as with a
> default setting for the --color argument.  Colour codes are emitted
> only when the output is a terminal.  Presumably the same would go for
> Bidi controls.

Exactly, that's what I have in mind in the long run. If coreutils
folks like the idea, "ls" could have a new option
--bidi=never/auto/always. With BiDi mode, it would enclose each of the
logical segments of strings that potentially contain RTL text
(filenames, dates etc.) separately inside an FSI...PDI block. That way
its output would look as desired (over the terminal's new default
"implicit" mode), since the terminal would take care of BiDi-ing each
FSI...PDI block.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi,

> > And if you argue "so make emacs print your
> > new code to disable formatting", so do thousands of other programs that are
> > less sophisticated than emacs.
>
> Yes, I do argue that emacs will need to print a new escape sequence.
> Which is much-much-much-much-much better than having to tell users to
> go into the settings of their macOS Terminal / Konsole /
> gnome-terminal etc. and disable BiDi there, isn't it?

Let me phrase it slightly differently. Emacs will not "need" to print
a new escape sequence, but will have the possibility to do so.

VTE is pretty certainly going to switch its default behavior to what
Konsole, PuTTY, Mlterm, macOS Terminal do now: to perform BiDi on its
contents. This mode is not suitable for Emacs or for any BiDi-aware
text editor.

Similarly to these terminal emulators, GNOME Terminal (and hopefully
other VTE-based frontends) will also most likely have a user setting
to force disable BiDi.

But as opposed to the aforementioned terminals, VTE will also turn off
BiDi upon a designated escape sequence.

VTE is the terminal widget behind several emulator apps, such as GNOME
Terminal, Xfce Terminal, Tilix, Terminator, Guake... I don't have
metrics, but according to various user polls I have the feeling that
VTE's usage share among Linux users is pretty significant, somewhere
in the ballpark of 50%.

Of course Emacs, or any other text editor, can still point its users
to the terminal's setting to disable BiDi. And then if the user also
wishes to have BiDi for "cat", they'll have to keep toggling it back
and forth. Or Emacs can emit the new escape sequence and then it will
be fully automatic.

Which one puts less supporting burden on Emacs's developers and
supporters? Which one is the better for the users? I think the answer
is the same to these two questions, and you sure know which answer I'm
thinking of.

According to this specification, nothing is going to be "worse" than
it already is in those few aforementioned terminal emulators. The new
default behavior will be the same as their behavior. We'll just
further extend it with the possibility of switching back to the old
mode without annoying the user.

I hope this clarifies a lot.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi,

> Arabic terminals and terminal emulators existed at the time of Unicode 1.0.

I haven't found any mention of them, let alone any documentation about them.

> If you are trying to emulate those services, for example so that older 
> software can run, you would need to look at how these programs expected to be 
> fed their data.

My goal is not to have those ancient software run. My goal is to look
into the future. Address the requests often seen in current terminal
emulator's bugtrackers. Stop the utterly unacceptable user experience
of current self-proclaimed BiDi-aware terminals where in order to run
Emacs you need to fiddle with the terminal's settings. Show that BiDi
in terminals is a much more complex story than just shuffling around
the characters, thus stopping new emulators from taking this broken
route which causes about as much damage as good. Create a platform on
top of which modern BiDi-aware experience can be created, to make both
"cat" and "emacs" work properly out of the box for BiDi users.

> I see little reason to reinvent things here, because we are talking about 
> emulating legacy hardware. Or are we not?

As per the above, no, not really. I'm not aware of any hardware that
supported BiDi, was there any? I look at terminal emulators as
extremely powerful tools for getting all kinds of work done. They are
continuously being improved, nowadays many terminal emulators contain
plenty of features that weren't there in any hardware one. I'm looking
for smoothlessly extending the terminal emulator experience to the RTL
/ BiDi world.

> It's conceivable, that with modern fonts, one can show some characters that 
> could not be supported on the actual legacy hardware, because that was 
> limited by available character memory and available pre-Unicode character 
> sets. As long as the new characters otherwise fit the paradigm (character per 
> cell) they can be supported without other changes in the protocol beyond 
> change in character set.

Which protocol, the protocol of non-BiDi-aware terminals that lays out
everything from left to right, so the output of "echo", "cat" etc. are
reversed; or the protocol of self-proclaimed BiDi-aware terminals
where it's literally impossible to create a proper BiDi-aware text
editor?

My work focuses on proving that both of these modes are needed, and
how the necessary mode switches could happen automatically behind the
scenes.

> However, I would not expect an emulator to accept data in NFD for example.

Many emulators do.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi,

On Wed, Jan 30, 2019 at 5:31 PM Adam Borowski  wrote:

> The program (emacs in this case) can do arbitrary reordering of characters
> on the grid, it also has lots of information the terminal doesn't.  For
> example, what are you going to do when there's a line longer than what fits
> on the screen?  Emacs will cut and hide part of it; any attempts to reorder
> that paragraph by the terminal are outright broken as you don't _have_ the
> paragraph.  Same for a popup window on the middle of the screen partially
> obscuring some text underneath.

This is absolutely correct so far.

> And if you argue "so make emacs print your
> new code to disable formatting", so do thousands of other programs that are
> less sophisticated than emacs.

Yes, I do argue that emacs will need to print a new escape sequence.
Which is much-much-much-much-much better than having to tell users to
go into the settings of their macOS Terminal / Konsole /
gnome-terminal etc. and disable BiDi there, isn't it?

Could you please give me a brief idea about those "thousands of other
programs" that will need to be adjusted? What other apps can do BiDi?
Not even Vim/NeoVim can do it.

If an app doesn't support BiDi, it's broken anyways when encountering
RTL text. It'll still be broken, just broken differently. Did you mean
all these programs as those thousands?

For ncurses apps there's also a workaround that you could apply:
create a terminfo where the ti/te entries not only switch to/from the
alternate screen but also disable/enable BiDi. In that case all these
thousand ones will be "fixed" (that is: broken in the "old" way rather
than broken in the "new" way).

On the other hand, what you absolutely can *not* do automatically by
emitting escape sequences at the right times, is to enclose the output
of much lighter utilities like "echo", "cat", "grep", "head" and so on
with any kind of BiDi controls.

> On the other hand, all that the program can output is a sequence of Unicode
> codepoints.  These don't include shaping information

With "presentation form" characters, yes, they can, they do including
shaping information.

> and are not supposed
> to.  The shaping is explicitly meant to be done by the terminal,

Why?

> and it's
> the terminal who's equipped with _most_ of the needed data

Why? It's the app that knows the context characters, it's the app that
knows the language.

What is it that the terminal knows, but the app doesn't although
should, or what is it that the terminal doesn't know if presentation
form characters are used?

What is it that the app knows but cannot pass to the terminal?
Shouldn't we then extend the protocol so that it can pass these, too?

e.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi,

> Personally, I think we should simply assume that complex script
> shaping is left to the terminal, and if the terminal cannot do that,
> then that's a restriction of working on a text terminal.

I cannot read any of the Arabic, Syriac etc. scripts, but I did lots
of experimenting with picking random Arabic words, displaying in the
terminal their unshaped as well as shaped (using presentation form
characters) variants, and compared them to pango-view's (harfbuzz's)
rendering.

To my eyes the version I got in the terminal with the presentation
form characters matched the "expected" (pango-view) rendering
extremely closely. Of course there's still some tradeoffs due to fixed
with cells (just as in English, arguably an "i" and "w" having the
same width doesn't look as nice as with proportional fonts). In the
mean time, the unshaped characters looks vastly differently.

> OTOH a terminal emulator who wants to perform shaping needs
> information from the application

And the presentation form characters are pretty much exactly that
information, aren't they (for Arabic)?

> There's nothing you can do here [...] there's no way for the application to 
> provide

Instead of saying that it's not possible, could we perpahs try to
solve it, or at least substantially improve the situation? I mean, for
example we can introduce control characters that specify the language.
We can introduce a flag that tells the terminal whether to do shaping
or not. There are probably plenty of more ideas to be thrown in for
discussion and improvement.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi Eli,

On Wed, Jan 30, 2019 at 5:10 PM Eli Zaretskii  wrote:

> I think the application could use TAB characters to get to the next
> cell, then simplistic reordering would also work.

TAB handling is extremely complicated, because in terminal emulation
TAB is not a character, TAB is a control instruction (like escape
sequences) that moves the cursor (and jumps through the existing
content, if any, without erasing it). Some terminal emulators perform
some magic to remember TABs in certain circumstances, but they cannot
always do so.

There are plenty of other problems, e.g. how they are handled at the
end of line (no, they don't wrap to the next line), how their
positions are user-configurable and not necessarily at every 8th
column etc., I'm not going into these details now if you don't mind,
it's just not a feasible approach.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi Eli,

> Does anyone know of a terminal emulator which supports isolates?

GNOME Terminal's (VTE's) current work-in-progress implementation does
remember BiDi control characters just like it remembers combining
accents, that is, tied to the preceding letter's cell. It uses FriBidi
1.0 for the BiDi work, so yes, it supports Unicode 6.3's isolates.

There's one significant issue, though. Because we currently just
misuse our existing infrastructure of combining accents for the BiDi
controls, BiDi controls at the very beginning of a paragraph are
dropped. Addressing this issue would need core changes to the terminal
emulation behavior, such as introducing in-between-cells storage, or
zero-width special characters belonging to a cell _before_ the cell's
actual letter, or something like this. I outline one idea in my
specification, but it's subject to discussion to finalize it.

(There's also a less significant issue: copy-pasting fragments of text
probably doesn't produce the contents that make the most sense wrt.
BiDi controls. I'm not sure what other software do here, though.)

Mintty is also actively working on BiDi support, I believe its author
just recently added support for isolates. It uses its own BiDi
implementation.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Egmont Koblinger via Unicode

Hi Eli,

> It doesn't do _any_ shaping.  Complex script shaping is left to the
> terminal, because it's impossible to do shaping in any reasonable way
> [...]

Partially, you are right. On the other hand, as far as I know, shaping
should take into account the neighboring glyphs even if those are not
visible (e.g. overflow from the viewport), and the terminal is unaware
of what those glyps are. This is an area that "presentation form"
characters can address for Arabic – although as it was pointed out,
not for Syrian and some others.

I'd say it's subject to further research and improvement to find the
ideal behavior.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Richard Wordingham via Unicode

On Wed, 30 Jan 2019 20:35:36 -0500
"Mark E. Shoulson via Unicode"  wrote:

> On 1/30/19 8:58 AM, Egmont Koblinger via Unicode wrote:
> > There's another side to the entire BiDi story, though. Simple
> > utilities like "echo", "cat", "ls", "grep" and so on, line editing
> > experience of your shell, these kinds. It's absolutely not feasible
> > to add BiDi support to these utilities. Here the only viable
> > approach is to have the terminal emulator do it.  
> 
> How will "ls -l" possibly work?  This is an example of the "table" 
> layout you were already discussing.

I think the answer is that it will use the same trickery as with a
default setting for the --color argument.  Colour codes are emitted
only when the output is a terminal.  Presumably the same would go for
Bidi controls.

> 
> I think us command-line troglodytes just have to deal with not having
> a whole lot of BiDi support.  There's simply no way any terminal
> emulator could possibly know what makes sense and what doesn't for a
> given line of text, coming from some random program.  Your "grep"
> could be grepping from a file with ANY layout, not necessarily one
> conducive to terminal layout, and so on.

So how do editors work now?

To avoid confusion, you will have to work with the terminal set to
having LTR paragraphs (or RTL instead); that is how Notepad works.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-01-31 Thread Martin J . Dürst via Unicode

On 2019/01/31 07:02, Richard Wordingham via Unicode wrote:
> On Wed, 30 Jan 2019 15:33:38 +0100
> Frédéric Grosshans via Unicode  wrote:
> 
>> Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :
>>> - It doesn't do Arabic shaping. In my recommendation I'm arguing
>>> that in this mode, where shuffling the characters is the task of
>>> the text editor and not the terminal, so should it be for Arabic
>>> shaping using presentation form characters.
>>
>> I guess Arabic shaping is doable through presentation form
>> characters, because the latter are character inherited from legacy
>> standards using them in such solutions.
> 
> So long as you don't care about local variants, e.g. U+0763 ARABIC
> LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
> characters.

Same also for characters used for other languages than Arabic.

> Basic Arabic shaping, at the level of a typewriter, is straightforward
> enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
> would be trickier - one cell or two?

Same for other characters. A medial Beh/Teh/Theh/... (ببب) in any 
reasonably decent rendering should take quite a bit less space than a 
Seen or Sheen (سسس). I remember that the multilingual Emacs version 
mostly written by Ken'ichi Handa (was it called mEmacs or nEmacs or 
something like that?) had different widths only just for Arabic. In 
Thunderbird, which is what I'm using here, I get hopelessly 
stretched/squeezed glyph shapes, which definitely don't look good.

Regards,   Martin.

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Mark E. Shoulson via Unicode


On 1/30/19 8:58 AM, Egmont Koblinger via Unicode wrote:

There's another side to the entire BiDi story, though. Simple
utilities like "echo", "cat", "ls", "grep" and so on, line editing
experience of your shell, these kinds. It's absolutely not feasible to
add BiDi support to these utilities. Here the only viable approach is
to have the terminal emulator do it.


How will "ls -l" possibly work?  This is an example of the "table" 
layout you were already discussing.


I think us command-line troglodytes just have to deal with not having a 
whole lot of BiDi support.  There's simply no way any terminal emulator 
could possibly know what makes sense and what doesn't for a given line 
of text, coming from some random program.  Your "grep" could be grepping 
from a file with ANY layout, not necessarily one conducive to terminal 
layout, and so on.


~mark

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Asmus Freytag via Unicode


  
  
Arabic terminals and terminal emulators
  existed at the time of Unicode 1.0. If you are trying to emulate
  those services, for example so that older software can run, you
  would need to look at how these programs expected to be fed their
  data.


I see little reason to reinvent things
  here, because we are talking about emulating legacy hardware. Or
  are we not?


It's conceivable, that with modern
  fonts, one can show some characters that could not be supported on
  the actual legacy hardware, because that was limited by available
  character memory and available pre-Unicode character sets. As long
  as the new characters otherwise fit the paradigm (character per
  cell) they can be supported without other changes in the protocol
  beyond change in character set.


However, I would not expect an emulator
  to accept data in NFD for example.


A./





On 1/30/2019 2:02 PM, Richard
  Wordingham via Unicode wrote:


  On Wed, 30 Jan 2019 15:33:38 +0100
Frédéric Grosshans via Unicode  wrote:


  
Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :


  - It doesn't do Arabic shaping. In my recommendation I'm arguing
that in this mode, where shuffling the characters is the task of
the text editor and not the terminal, so should it be for Arabic
shaping using presentation form characters.  



I guess Arabic shaping is doable through presentation form
characters, because the latter are character inherited from legacy
standards using them in such solutions.

  
  
So long as you don't care about local variants, e.g. U+0763 ARABIC
LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
characters.

Basic Arabic shaping, at the level of a typewriter, is straightforward
enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
would be trickier - one cell or two?


  
But if you want to support
other “arabic like” scripts (like Syriac, N’ko), or even some LTR
complex scripts, like Myanmar or Khmer, this “solution” cannot work,
because no equivalent of “presentation form characters” exists for
these scripts

  
  
I believe combining marks present issues even in implicit modes.  In
implicit mode, one cannot simply delegate the task to normal text
rendering, for one has to allocate text to cells.  There are a number
of complications that spring to mind:

1) Some characters decompose to two characters that may otherwise lay
claim to their own cells:

U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
0654>.  Do you intend that your scheme be usable by Unicode-compliant
processes?

2) 2-part vowels, such as U+0D4A MALAYALAM VOWEL SIGN O, which
canonically decomposes into a preceding combining mark U+0D46 MALAYALAM
VOWEL SIGN E and following combining mark U+0D3E MALAYALAM VOWEL SIGN
AA.

3) Similar 2-part vowels that do not decompose, such as U+17C4 KHMER
VOWEL SIGN OO.  OpenType layout decomposes that into a preceding
'U+17C1 KHMER VOWEL SIGN E' and the second part.

4) Indic conjuncts.
(i) There are some conjuncts, such as Devanagari K.SSA, where a
display as ,  is simply unacceptable.  In some
closely related scripts, this conjunct has the status of a character.

(ii) In some scripts, e.g. Khmer, the virama-equivalent is not an
acceptable alternative to form a consonant stack.  Khmer could
equally well have been encoded with a set of subscript consonants in
the same manner as Tibetan.

(iii) In some scripts, there are marks named as medial consonants
which function in exactly the same way as <'virama', consonant>; it is
silly to render them in entirely different manners.

5) Some non-spacing marks are spacing marks in some contexts.  U+102F
MYANMAR VOWEL SIGN U is probably the best known example.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Richard Wordingham via Unicode

On Wed, 30 Jan 2019 15:33:38 +0100
Frédéric Grosshans via Unicode  wrote:

> Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :
> > - It doesn't do Arabic shaping. In my recommendation I'm arguing
> > that in this mode, where shuffling the characters is the task of
> > the text editor and not the terminal, so should it be for Arabic
> > shaping using presentation form characters.  
> 
> I guess Arabic shaping is doable through presentation form
> characters, because the latter are character inherited from legacy
> standards using them in such solutions.

So long as you don't care about local variants, e.g. U+0763 ARABIC
LETTER KEHEH WITH THREE DOTS ABOVE.  It has no presentation form
characters.

Basic Arabic shaping, at the level of a typewriter, is straightforward
enough to leave to a terminal emulator, as Eli has suggested.  Lam-alif
would be trickier - one cell or two?

> But if you want to support
> other “arabic like” scripts (like Syriac, N’ko), or even some LTR
> complex scripts, like Myanmar or Khmer, this “solution” cannot work,
> because no equivalent of “presentation form characters” exists for
> these scripts

I believe combining marks present issues even in implicit modes.  In
implicit mode, one cannot simply delegate the task to normal text
rendering, for one has to allocate text to cells.  There are a number
of complications that spring to mind:

1) Some characters decompose to two characters that may otherwise lay
claim to their own cells:

U+06D3 ARABIC LETTER YEH BARREE WITH HAMZA ABOVE decomposes to <06D2,
0654>.  Do you intend that your scheme be usable by Unicode-compliant
processes?

2) 2-part vowels, such as U+0D4A MALAYALAM VOWEL SIGN O, which
canonically decomposes into a preceding combining mark U+0D46 MALAYALAM
VOWEL SIGN E and following combining mark U+0D3E MALAYALAM VOWEL SIGN
AA.

3) Similar 2-part vowels that do not decompose, such as U+17C4 KHMER
VOWEL SIGN OO.  OpenType layout decomposes that into a preceding
'U+17C1 KHMER VOWEL SIGN E' and the second part.

4) Indic conjuncts.
(i) There are some conjuncts, such as Devanagari K.SSA, where a
display as ,  is simply unacceptable.  In some
closely related scripts, this conjunct has the status of a character.

(ii) In some scripts, e.g. Khmer, the virama-equivalent is not an
acceptable alternative to form a consonant stack.  Khmer could
equally well have been encoded with a set of subscript consonants in
the same manner as Tibetan.

(iii) In some scripts, there are marks named as medial consonants
which function in exactly the same way as <'virama', consonant>; it is
silly to render them in entirely different manners.

5) Some non-spacing marks are spacing marks in some contexts.  U+102F
MYANMAR VOWEL SIGN U is probably the best known example.

Richard.

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Adam Borowski via Unicode

On Wed, Jan 30, 2019 at 05:56:00PM +0200, Eli Zaretskii via Unicode wrote:
> > - It doesn't do Arabic shaping.
> 
> It doesn't do _any_ shaping.  Complex script shaping is left to the
> terminal, because it's impossible to do shaping in any reasonable way
> without controlling the fonts being used and accessing the font
> information, and this is not possible when you run on a terminal

It's the inverse of the situation with RTL reordering.  The interface
between the program and the terminal is a character cell grid (really, a
sequence of printables and \e-based codes, but that's a technical detail).

The program (emacs in this case) can do arbitrary reordering of characters
on the grid, it also has lots of information the terminal doesn't.  For
example, what are you going to do when there's a line longer than what fits
on the screen?  Emacs will cut and hide part of it; any attempts to reorder
that paragraph by the terminal are outright broken as you don't _have_ the
paragraph.  Same for a popup window on the middle of the screen partially
obscuring some text underneath.  And if you argue "so make emacs print your
new code to disable formatting", so do thousands of other programs that are
less sophisticated than emacs.

On the other hand, all that the program can output is a sequence of Unicode
codepoints.  These don't include shaping information, and are not supposed
to.  The shaping is explicitly meant to be done by the terminal, and it's
the terminal who's equipped with _most_ of the needed data (it might lack
context just outside screen's end or under an overlapped window, but that's
not specific to complex shaping -- same can happen for the other half of a
CJK character).  You know if the font used supports shaping, you can have
access to a graphic view (as opposed to the array of codepoints) -- heck,
it's only you who know the text is rendered on a screen rather than a
Braille device.  And if you miss an opportunity to shape something, the
result is still readable to the user, merely not as good as it could be.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Eli Zaretskii via Unicode

> Date: Wed, 30 Jan 2019 15:49:34 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> I outline in the document problems that arise from the terminal
> emulator performing shaping on its contents in "explicit" mode, which
> is to be used by Emacs and others. The terminal emulator is not aware
> of the characters that are chopped off at the edge of the screen,
> required for shaping. The terminal emulator is not aware of which
> characters happen to be placed next to each other, but belong to
> semantically different UI elements, that is, shouldn't be shaped.
> 
> (And as a side note, FriBidi doesn't provide a method for doing
> shaping on _visual_ order. I'm unsure about other libraries, and
> unsure if there's an algorithm for it at all.)
> 
> Honestly, I have no idea how to best address all these problems at
> once. This is where we can think of extensions "expliti mode level 2",
> use control characters that explicitly specify how to shape certain
> glyphs. This is subject to further research.

Personally, I think we should simply assume that complex script
shaping is left to the terminal, and if the terminal cannot do that,
then that's a restriction of working on a text terminal.  There's
nothing you can do here, because correct shaping requires too many
features that applications running on text terminals cannot use, and
OTOH a terminal emulator who wants to perform shaping needs
information from the application (like the directionality of the text
and its language) that there's no way for the application to provide.

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Eli Zaretskii via Unicode

> Date: Wed, 30 Jan 2019 15:25:32 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> > ╒═══╤══╕
> > │ filename1 │  123 │
> > │ FILENAME2 │   17 │
> > └───┴──┘
> >
> > I'm afraid there's no good way to do BiDi without support from individual
> > programs.
> 
> In this particular example, when the output consists of RTL text in
> logical order (the emitter does not reorder the characters to their
> visual order, nor emit any BiDi controls), combined with line drawing
> and such, there is hardly anything we could do purely on the terminal
> emulator's side.

I think the application could use TAB characters to get to the next
cell, then simplistic reordering would also work.

But in general, yes: this is one of the examples why sophisticated
text-editing applications cannot leave this to the terminal.  (Another
example is handling mouse clicks, if the terminal supports that.)

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Eli Zaretskii via Unicode

> Date: Wed, 30 Jan 2019 15:07:22 +0100
> Cc: unicode@unicode.org
> From: Egmont Koblinger via Unicode 
> 
> Another possible approach is to leave the terminal doing BiDi, but
> embed all the text fragments in FSI...PDI blocks.

Does anyone know of a terminal emulator which supports isolates?

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Eli Zaretskii via Unicode

> From: Egmont Koblinger 
> Date: Wed, 30 Jan 2019 14:36:42 +0100
> Cc: unicode@unicode.org
> 
> - GNU Emacs reshuffles the characters according to the BiDi algorithm,
> expecting that the terminal emulator doesn't do any BiDi.

Yes, users are told to disable bidi reordering of the terminal, if the
terminal supports that.

> - It doesn't do Arabic shaping.

It doesn't do _any_ shaping.  Complex script shaping is left to the
terminal, because it's impossible to do shaping in any reasonable way
without controlling the fonts being used and accessing the font
information, and this is not possible when you run on a terminal --
the user configures the terminal emulator, and the emulator chooses
the fonts it likes/needs according to that configuration.

(Emacs does support complex script shaping on GUI displays.)

> - When it comes to visually wrapping a line because it doesn't fit in
> the current width, Emacs goes its own way which doesn't match what the
> Unicode BiDi algorithm says.

Yes, this deviation is documented in the Emacs manuals.  The reason
for that is that the Emacs implementation of the UBA reorders
characters on the fly (i.e., it implements a function that can be
called character by character, and returns the next character in the
visual order).  This was done due to a special structure of the Emacs
display engine and efficiency considerations.

In practice, this problem happens very rarely.

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

> A formatted table is pretty unsuitable for automated processing, and
> obviously meant for human display.

Could you please clarify how exactly that data looks like? Maybe a
tiny hexdump of an example?

Is the RTL piece of text already stored in visual order, that is,
beginning with the leftmost (last logical) letter of the word? If so
then you can sure display it properly in BiDi-unaware rendering
engines (including most terminal emulators currently, as well as in
"explicit" mode according to my specification). That is, whoever
produces that data reverses that word for you?

Or is the RTL piece of text still in its logical order? Then in what
piece of software does this formatted data show up to you in a
readable way?

> You're a terminal emulator maintainer, thus it's natural for you to think
> it's the right place to come up with a solution.

No. I've been a maintainer/developer/contributor to all kinds of
software, including (but not limited to) terminal emulators, apps
running inside terminal emulators, or a pretty complex RTL homepage.
I'm doing my best in looking at the entire ecosystem, and coming up
with a good BiDi-aware interface between terminal emulators and
applications.

> I'd argue that it's not --
> all a terminal emulator can do is to display already formatted text, there's
> no sane way to move things around.

You missed that your use case with this table is not the only possible
use case. There are others where the terminal needs to do BiDi. My
work aims to address multiple use cases at once, yours being one of
them.


cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Adam Borowski via Unicode

On Wed, Jan 30, 2019 at 03:43:10PM +0100, Egmont Koblinger via Unicode wrote:
> On Wed, Jan 30, 2019 at 3:32 PM Adam Borowski  wrote:
> 
> > > > ╒═══╤══╕
> > > > │ filename1 │  123 │
> > > > │ FILENAME2 │   17 │
> > > > └───┴──┘
> 
> > That's possible only if the program in question is running directly attached
> > to the tty.  That's not an option if the output is redirected.  Frames in
> > a plain text file are a perfectly rational, and pretty widespread, use --
> > and your proposal will break them all.  Be it "cat" to the screen, "less" or
> > even "mutt" if the text was sent via a mail.
> 
> I'd argue that if you have such a data stored in a file, with logical
> order used in Arabic or Hebrew text, combined with line drawing chars
> as you showed, then your data is broken to begin with – broken in the
> sense that it's suitable for automated processing (*), but not for
> display.

A formatted table is pretty unsuitable for automated processing, and
obviously meant for human display.

> (*) but then line drawing chars are not really a nice choice over CSV,
> JSON, whatever.

That's why you use CSV and JSON for machine-readable, plain text for humans,
and XML for neither.

> The only possible choice is for some display engine to be aware that
> line drawing characters are part of a "higher level protocol", and
> BiDi should be applied only in the lower scope. I don't think the
> terminal emulator is the right place to make such decisions

At this point, required information is lost.  Any transformations such as
RTL reordering needs to be done earlier, when you still see _unformatted_
version of the data.

You're a terminal emulator maintainer, thus it's natural for you to think
it's the right place to come up with a solution.  I'd argue that it's not --
all a terminal emulator can do is to display already formatted text, there's
no sane way to move things around.  Any changes need to be localized -- for
example, you can do ligatures only if you keep total length unchanged.  Ie,
the terminal emulator is the right layer for things like complex script
shaping, but not RTL reordering.

Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

Hi Frédéric,

> I guess Arabic shaping is doable through presentation form characters,
> because the latter are character inherited from legacy standards using
> them in such solutions. But if you want to support other “arabic like”
> scripts (like Syriac, N’ko), or even some LTR complex scripts, like
> Myanmar or Khmer, this “solution” cannot work, because no equivalent of
> “presentation form characters” exists for these scripts

Unfortunately my knowledge ends here, I'm not familiar with shaping
for Syriac and other similar scripts. I'd really appreciate input from
experts here.

I outline in the document problems that arise from the terminal
emulator performing shaping on its contents in "explicit" mode, which
is to be used by Emacs and others. The terminal emulator is not aware
of the characters that are chopped off at the edge of the screen,
required for shaping. The terminal emulator is not aware of which
characters happen to be placed next to each other, but belong to
semantically different UI elements, that is, shouldn't be shaped.

(And as a side note, FriBidi doesn't provide a method for doing
shaping on _visual_ order. I'm unsure about other libraries, and
unsure if there's an algorithm for it at all.)

Honestly, I have no idea how to best address all these problems at
once. This is where we can think of extensions "expliti mode level 2",
use control characters that explicitly specify how to shape certain
glyphs. This is subject to further research.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

On Wed, Jan 30, 2019 at 3:32 PM Adam Borowski  wrote:

> > > ╒═══╤══╕
> > > │ filename1 │  123 │
> > > │ FILENAME2 │   17 │
> > > └───┴──┘

> That's possible only if the program in question is running directly attached
> to the tty.  That's not an option if the output is redirected.  Frames in
> a plain text file are a perfectly rational, and pretty widespread, use --
> and your proposal will break them all.  Be it "cat" to the screen, "less" or
> even "mutt" if the text was sent via a mail.

I'd argue that if you have such a data stored in a file, with logical
order used in Arabic or Hebrew text, combined with line drawing chars
as you showed, then your data is broken to begin with – broken in the
sense that it's suitable for automated processing (*), but not for
display. I can't think of any utility that would display it properly,
because that's not what the Unicode BiDi algorithm run over this data
produces.

(*) but then line drawing chars are not really a nice choice over CSV,
JSON, whatever.

The only possible choice is for some display engine to be aware that
line drawing characters are part of a "higher level protocol", and
BiDi should be applied only in the lower scope. I don't think the
terminal emulator is the right place to make such decisions – I don't
think any other generic tool (graphical word processor, browser etc.)
does make such a call either.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Frédéric Grosshans via Unicode


Le 30/01/2019 à 14:36, Egmont Koblinger via Unicode a écrit :

- It doesn't do Arabic shaping. In my recommendation I'm arguing that
in this mode, where shuffling the characters is the task of the text
editor and not the terminal, so should it be for Arabic shaping using
presentation form characters.


I guess Arabic shaping is doable through presentation form characters, 
because the latter are character inherited from legacy standards using 
them in such solutions. But if you want to support other “arabic like” 
scripts (like Syriac, N’ko), or even some LTR complex scripts, like 
Myanmar or Khmer, this “solution” cannot work, because no equivalent of 
“presentation form characters” exists for these scripts

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Adam Borowski via Unicode

On Wed, Jan 30, 2019 at 03:25:32PM +0100, Egmont Koblinger via Unicode wrote:
> One more note, to hopefully clarify:
> > ╒═══╤══╕
> > │ filename1 │  123 │
> > │ FILENAME2 │   17 │
> > └───┴──┘
> >
> > I'm afraid there's no good way to do BiDi without support from individual
> > programs.
> 
> In this particular example, when the output consists of RTL text in
> logical order (the emitter does not reorder the characters to their
> visual order, nor emit any BiDi controls), combined with line drawing
> and such, there is hardly anything we could do purely on the terminal
> emulator's side.
> 
> I did not consider the possibility of certain characters (e.g. line
> drawing ones) being "stop characters", and BiDi to get applied only in
> runs of other characters. Any such magic would be arbitrary, fix a
> subset of the cases while cause other unforeseen breakages elsewhere.
> E.g. what if someone intentionally uses these characters as
> letter-like ones in a BiDi text, like """here I'm talking about the
> '└' shaped corner"""... or what if poor man's ASCII pipe and other
> symbols are used... it's way too risky to go into any kind of
> heuristics.
> 
> In this particular case the terminal cannot magically fix the output
> for you, you'll need to get the application fixed.

That's possible only if the program in question is running directly attached
to the tty.  That's not an option if the output is redirected.  Frames in
a plain text file are a perfectly rational, and pretty widespread, use --
and your proposal will break them all.  Be it "cat" to the screen, "less" or
even "mutt" if the text was sent via a mail.

I don't really see a possibility to do that on the terminal's side, any RTL
reordering would need to be done by the program in question.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

Hi Adam,

One more note, to hopefully clarify:

> ╒═══╤══╕
> │ filename1 │  123 │
> │ FILENAME2 │   17 │
> └───┴──┘
>
> I'm afraid there's no good way to do BiDi without support from individual
> programs.

In this particular example, when the output consists of RTL text in
logical order (the emitter does not reorder the characters to their
visual order, nor emit any BiDi controls), combined with line drawing
and such, there is hardly anything we could do purely on the terminal
emulator's side.

I did not consider the possibility of certain characters (e.g. line
drawing ones) being "stop characters", and BiDi to get applied only in
runs of other characters. Any such magic would be arbitrary, fix a
subset of the cases while cause other unforeseen breakages elsewhere.
E.g. what if someone intentionally uses these characters as
letter-like ones in a BiDi text, like """here I'm talking about the
'└' shaped corner"""... or what if poor man's ASCII pipe and other
symbols are used... it's way too risky to go into any kind of
heuristics.

In this particular case the terminal cannot magically fix the output
for you, you'll need to get the application fixed.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

Hi Adam,

> Even a line is way too big a piece to be safely reordered by the terminal.
> What you propose will break every full-screen program that uses line-drawing
> characters:

Certain terminal emulators already perform BiDi on their lines. They
already break every full-screen program with line-drawing and such, as
you pointed out. What my proposal adds, amongst plenty of other
things, is a means to automatically disable the terminal's BiDi,
rather than having to go to its settings. This way you can automate
the fix of the apps that aren't explicitly fixed, e.g. via wrapper
scripts, or terminfo entries with special ti/te definitions.

> I'm afraid there's no good way to do BiDi without support from individual
> programs.

Depends on the use case.

For complex apps, like text editors, you are right, the terminal
emulator must stay out of the game.

For simple utilities, like "cat" and friends, there's no way you can
implement BiDi support in "cat" itself. Here the terminal needs to do
it.

Your use case with tables is perhaps somewhat in the middle. One
possible approach for the emitting utility is to disable BiDi in the
terminal (switch to "explicit" mode) for the scope of this output.
Another possible approach is to leave the terminal doing BiDi, but
embed all the text fragments in FSI...PDI blocks. (This latter is
subject to a bit of further research, to be exactly specified in a
forthcoming version of the specs.)

What is extremely tough here is realizing that there are multiple
conflicting requirements (including the example you gave), and coming
up with a soluiton that satisfies the needs of all. This is what my
work aims to do.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

Hi Eli,

> My personal experience with bringing BiDi to Emacs led me to a firm
> conclusion that BiDi support by terminal emulators cannot be relied on
> by sophisticated text editing and display applications that are
> BiDi-aware.  The terminal emulator can never be smart enough to do
> what the editing needs require, so the application eventually ends up
> jumping through hoops in order to trick the terminal into doing TRT.
> It is easier to tell users to disable BiDi support of the terminal (if
> it even has one), and do everything in the app.  This is the only way
> of having full control of what is displayed, especially when
> "higher-level protocols" need to be used to tailor the UBA to the need
> of the user, because there's usually no way of asking the terminal to
> apply a behavior which deviates from the UBA.

We are absolutely on the same page here. As long as the use case is
text editing or something similar, it's harmful if the terminal
emulator aims to do any BiDi.

Having to tell users to turn off BiDi in the emulator's settings is in
my firm opinion a user experience no-go. It has to be automatic,
happen under the hood, that is, using escape sequences.

There's another side to the entire BiDi story, though. Simple
utilities like "echo", "cat", "ls", "grep" and so on, line editing
experience of your shell, these kinds. It's absolutely not feasible to
add BiDi support to these utilities. Here the only viable approach is
to have the terminal emulator do it.

Hence, as I confirm ECMA TR/53's realization of 28 years ago, there
have to be two substantially different modes. "Explicit" mode for what
you need for Emacs: the terminal to stay out of the game; and
"implicit" mode where the terminal performs BiDi for the sake of "cat"
and other simple utiltiies.

I'm also arguing that contrary to TR/53, there's no way to hook up a
mode switch to "cat" and a gazillion of other similar tools. The only
reaslisticly implementable approach is if the "implicit" mode is the
default so that simple utilities provide a proper BiDi experience.
Those very few fullscreen apps that do know what they are doing and do
want the terminal to leave the characters at their designated place
(such as Emacs, Vim etc.) will have to request this "explicit" mode
from the terminal.

cheers,
egmont

Re: Proposal for BiDi in terminal emulators

2019-01-30 Thread Egmont Koblinger via Unicode

Hi Eli,

> > In turn, vim, emacs and friends stand there clueless, not knowing
> > how to do BiDi in terminals.
>
> This is inaccurate: [...]

I have to admit, I was somewhat sloppy in the phrasing of this
announcement. My bad, apologies.

Currently some terminal emulators shuffle the characters around for
display purposes, while most don't. There's absolutely no way an
editor (no matter if Emacs or any other) could produce the desired
look on both kinds. I actually present a proof that an editor cannot
always produce the desired look on ones that shuffle their contents
around. So it's a somewhat reasonable expectation to produce the
desired look on ones that don't shuffle their cells.

In the document, more precisely at [1] I evalute my findings with GNU
Emacs 25.2. (I've just fixed the page to add "GNU", thanks for
pointing this out!)

Brief summary:

- GNU Emacs reshuffles the characters according to the BiDi algorithm,
expecting that the terminal emulator doesn't do any BiDi.

- According to my recommendation, in order to address BiDi in the
entire ecosystem around terminal emulators, the default behavior will
have to be that terminals shuffle the characters around. Don't worry,
there'll be a mode where this shuffling doesn't occur. Emacs (and all
other BiDi-aware text editors) will have to switch to this mode.

- It doesn't do Arabic shaping. In my recommendation I'm arguing that
in this mode, where shuffling the characters is the task of the text
editor and not the terminal, so should it be for Arabic shaping using
presentation form characters.

- When it comes to visually wrapping a line because it doesn't fit in
the current width, Emacs goes its own way which doesn't match what the
Unicode BiDi algorithm says. I'm not saying Emacs's behavior is bad
per se or unreasonable, and it's out of the scope of my work to try to
get it changed, but I'm making a note that it's different.

[1] https://terminal-wg.pages.freedesktop.org/bidi/prior-work/applications.html

cheers,
egmont

1 2 >

1 - 100 of 103 matches

Mail list logo