Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-06 Thread Philippe Verdy via Unicode
I read your email, you spoke for example about how a typical Unix/Linux
tool shows its usage option (e.g. "anycommand --help") with a leading line
then syntaxes and tabulated lists of options followed by translated help on
the same line.

There's some rules for correct display including with Bidi:

- Separate paragraphs that need a different default Bidi by double newlines
(to force a hard break)
- use a single newline on continuation
- if technical items are untranslatable, make sure they are at the begining
of lines and indented by some leading spaces, before translated ones.
- avoid breaking lists
- try to separate as much as posible text in natural languages from
technical texts.
- Be careful about correcty usage of leading punctuations (notably for list
items)
- Be consistant about indentation
- Normalize spaces,
- Don't ussume that TAB controls have the same width (ban TABS except at
the begining of lines)
- In column output, separate colums always with at least two spaces, don't
glue them as if they were sentences.
- Don't use "soft line breaks" in the middle of short lines (less than 72
base characters)
- Don't use any Bidi control !

With some cares, you can perfectly translate Linux/Unix tools in languages
needing Bidi and get consistant output, but be careful if your text
contains placeholders or technihcal untranslated terms (make sure to
surround them with paired punctuation, or don't translate them at all. And
avoid paragraphs that would mix natural and technical untranslatable terms
(such as command names or command-line options).

Make sure to test the output so that it will also work with varaible fonts
(don't assume monospaced fonts are used, they do not exist for various
scripts and don't work reliably for Arabic and most Asian scripts, and not
even for Chinese or Japanese even if these don't need Bidi support).

But the difficulty is not really in the terminal emulators but in the
source texts given to translators, when they don't know the context in
which the text will be used and have no hint about which terms should not
be translated (because they can become inconsistant: there are many
examples, even in Windows 10, where some of the command line tools are
completely unusable with the translated UI and with examples of syntaxes
that are not even working where some terms were randomly and inconsistantly
translated or confused, or because tools assumed an LTR-only layout of the
output, and monospaced fonts with one-to-one character per display cell, or
requiring specific fonts that do not contain the characters in their
monospaced variants: this is challenging notably for Asian scripts needing
complex clusters if you made these Latin-based assumptions)


Le mer. 6 févr. 2019 à 22:30, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> Thanks a lot for your input!
>
> Another fundamental difficulty with terminal emulators is: These
> controls (CR, LF...) are control instructions that move the cursor in
> some ways, and then are forgotten. You cannot do BiDi on the
> instructions the terminal receives. You can only do BiDi on the
> result, the contents of the canvas after these instructions are
> executed. Here these controls are either lost, or you have to give a
> specification how exactly they need to be remembered, i.e. converted
> to being part of the canvas's data.
>
> Let's also mention that trying to get apps into using them is quite
> hopeless. The best you can do is design BiDi around what you already
> have, which pretty much means hard vs. soft line endings, and
> hopefully forthcoming semantical marks around shell prompts. (To
> overcomplicate the story, a received LF doesn't convert the line
> ending to hard wrapped in most terminal emulators. In some it does. I
> don't think there's an exact specification anywhere. Maybe the BiDi
> spec needs to create one. Lines are hard wrapped by default, turned to
> soft wrapped when the text gets wrapped at the end of the line, and a
> few random control functions turn them back to hard one, but in most
> terminals, a newline is not such a control function.)
>
> Anyway, please also see my previous email; I hope that clarifies a lot
> for you, too.
>
>
> cheers,
> egmont
>
> On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
>  wrote:
> >
> > I think that before making any decision we must make some decision about
> what we mean by "newlines". There are in fact 3 different functions:
> > - (1) soft line breaks (which are used to enforce a maximum display
> width between paragraph margins): these are equivalent to breakable and
> compressible whitespaces, and do not change the logical paragraph
> direction, they don't insert any additionnal vertical gap between lines, so
> the logicial line-height is preserved and continues uninterrupted. If text
> justification applies, this whitespace will be entirely collapsed into the
> end margin, and any text before it will stilol be justified to match the
> end margin (until the maximum 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Richard,

> Not necessarily.  One could allow the first strong character in the
> prompt to determine the paragraph directions

How does Emacs know what's a prompt? How can it tell it from the
previous and next command's output?

Whatever it does to know where the prompt is, can it be made into a
standard, cross-terminal feature?

> That's what the Emacs
> terminal (invoked by M-x term; top level definition in term.el) does.

I tried it. Executed my default shell, and inside that, a "cat
TUTORIAL.he". All the paragraphs are rendered as LTR ones,
left-aligned. Not the way the file is opened in Emacs.

If you claim Emacs's built-in terminal emulator supports BiDi, I'm
kindly asking you to present a documentation of its behavior, in
similar spirit to my BiDi proposal.

> Not necessarily.  One might use cat to glue together files that had
> split into 1400k chunks, in which case it is not even reasonable to
> expect the end of file to be at a character boundary.  (Yes, floppy
> disks still have their uses.)

I did not say anything about changing cat's behavior. I recommended to
change the convention for such paragraph-oriented text files to end
with two newlines.

> But the white space between paragraphs is a separator, not a
> terminator.  One doesn't require it at the end when formatting
> paragraphs within the cell of a table.

Does this logic also apply to single newline characters? If not, why
not, what's the conceptual difference? If it does, why do text files
end in a newline?


e.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Richard Wordingham via Unicode
On Wed, 6 Feb 2019 22:01:59 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> (I'm getting lost where to reply, and how the subject gets mangled and
> the thread split into different ones.)
> 
> 
> I've thought about it a lot, experimented with Emacs's behavior, and
> I've arrived at the conclusion that we are actually much closer to
> each other than I had thought. Probably there's a lot of
> misunderstanding due to different terminology we used.
> 
> I've set my terminal to RTL paragraph direction (via the relevant
> escape sequence), then did a "cat TUTORIAL.he" (the file taken from
> 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
> one, and the one running in a terminal of no BiDi.
> 
> Apart from a few minor irrelevant differences, they look the same!
> Hooray!!!
> 
> (The differences are:
> 
> - I had to slightly modify TUTORIAL.he to make sure none of the lines
> start with a BiDi control (I added a preceding character) because
> currently VTE doesn't support them, there's no character cell to store
> this data. This definitely needs to be fixed in the second version of
> my proposal.
> 
> - Emacs running in a terminal shows an underscore wherever there's a
> BiDi control in the source file – while the graphical one doesn't.
> This looks like a simple bug to me, right?
> 
> - Line 1007, the copyright line of this file uses visual indentation,
> and Emacs detects LTR paragraph for that line. I think it should
> rather use BiDi controls to have an overall RTL paragraph direction
> detected, and within that BiDi controls to force LTR for the text. The
> terminal shows it with RTL direction, as I manually set it.
> 
> Again, all these three details are irrelevant to my point, namely that
> in WIP gnome-terminal it looks the same as in Emacs.)
> 
> 
> You define paragraphs as emptyline-separated blocks on which you
> perform autodetection of the paragraph direction. This is great! As
> I've mentioned, I'd love to have such a mode in terminals, but it's
> subject to underlying improvements, like knowing when a prompt starts
> and ends, because prompts also have to be paragraph delimiters.

Not necessarily.  One could allow the first strong character in the
prompt to determine the paragraph directions.  That's what the Emacs
terminal (invoked by M-x term; top level definition in term.el) does.

> On a nitpicking side note:
> 
> It's damn ugly not to terminate a text file with a newline. Newline is
> much better thought of a "terminator" than a "delimiter". For example,
> if you do a "cat file1 file2", you expect file2 to start on its own
> line.

Not necessarily.  One might use cat to glue together files that had
split into 1400k chunks, in which case it is not even reasonable to
expect the end of file to be at a character boundary.  (Yes, floppy
disks still have their uses.)

> Shouldn't this apply to paragraphs, too, especially when BiDi is in
> the game? I'd argue that an empty line (double newline) shouldn't be a
> delimiter, it should be a terminator for a paragraph.

But the white space between paragraphs is a separator, not a
terminator.  One doesn't require it at the end when formatting
paragraphs within the cell of a table. 

Richard.



Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi,

I was loose with my terminology once again, which is not a wise thing
when you're trying to clarify misunderstandings :)

> But once you have
> decided on a direction, each _line_ within that data is passed
> separately to the BiDi algorithm to get reshuffled; this is what Emacs
> does, this is what my specification says, and this is the right thing.
> That is, for this step, the definition of "paragraph", as the BiDi
> algorithm uses this term, is a line of the text file.

I keep thinking of the BiDi algorithm as one that takes a single
paragraph, because that's how I use it in VTE. But in fact, the BiDi
algorithm starts by splitting into paragraphs. I keep forgetting about
this outermost "for loop" of the BiDi algo.

And with that, proper definition, you can of course pass the entire
emptyline-delimited segment into the BiDi algorithm in a single step.
In its first phase, the BiDi algorithm will split it at newlines,
because for the BiDi algorithm (but not when detecting the paragraph
direction in Emacs), newline is the paragraph delimiter. Then it will
execute the rest of the algorithm for each paragraph (that is: line)
separately.

This is exactly the same as splitting manually, and then for each line
invoking the BiDi algorithm.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Philippe,

Thanks a lot for your input!

Another fundamental difficulty with terminal emulators is: These
controls (CR, LF...) are control instructions that move the cursor in
some ways, and then are forgotten. You cannot do BiDi on the
instructions the terminal receives. You can only do BiDi on the
result, the contents of the canvas after these instructions are
executed. Here these controls are either lost, or you have to give a
specification how exactly they need to be remembered, i.e. converted
to being part of the canvas's data.

Let's also mention that trying to get apps into using them is quite
hopeless. The best you can do is design BiDi around what you already
have, which pretty much means hard vs. soft line endings, and
hopefully forthcoming semantical marks around shell prompts. (To
overcomplicate the story, a received LF doesn't convert the line
ending to hard wrapped in most terminal emulators. In some it does. I
don't think there's an exact specification anywhere. Maybe the BiDi
spec needs to create one. Lines are hard wrapped by default, turned to
soft wrapped when the text gets wrapped at the end of the line, and a
few random control functions turn them back to hard one, but in most
terminals, a newline is not such a control function.)

Anyway, please also see my previous email; I hope that clarifies a lot
for you, too.


cheers,
egmont

On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
 wrote:
>
> I think that before making any decision we must make some decision about what 
> we mean by "newlines". There are in fact 3 different functions:
> - (1) soft line breaks (which are used to enforce a maximum display width 
> between paragraph margins): these are equivalent to breakable and 
> compressible whitespaces, and do not change the logical paragraph direction, 
> they don't insert any additionnal vertical gap between lines, so the logicial 
> line-height is preserved and continues uninterrupted. If text justification 
> applies, this whitespace will be entirely collapsed into the end margin, and 
> any text before it will stilol be justified to match the end margin (until 
> the maximum expansion of other whitespaces in the middle is reached, and the 
> maximum intercharacter gap is also reached (in which case, that line will not 
> longer be expanded more), but this does not apply to terminal emulators that 
> noramlly never use text justification, so the text will just be aligned to 
> the start margin and whitespaces before it on the same line are preserved, 
> and collapsed only at end of the line (just before the soft line break itself)
> - (2) hard line breaks: they break to a new line but continue the paragraph 
> within its same logical direction, but they are not compressible whitespaces 
> (and do not depend on the logical end margin of the paragraph.
> - (3) paragraph breaks: generally they introduce an addition vertical gap 
> with top and bottom margins
>
> The problem in terminals is that they usually cannot distinguish types (1) 
> and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. 
> Type (1) is only existing within the framework of a higher level protocol 
> which gives additional interpretation to these "newlines". The special 
> control LS is almost never used but may be used for type (1) i.e. soft 
> line-breaks, and will fallback to type (2) which is represented by the legacy 
> "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). 
> I have seen very little or no use of the LS (line separator) special control.
>
> Type (3) may be encoded with PS (paragraph separator), but in terminals (and 
> common protocols line MIME) it is usually encoded using a couple of newline 
> (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional 
> whitespaces (and additional presentation characters such as ">" in quotations 
> inserted in mail responses) between them (needed for MIME and HTTP) which may 
> be collapsed when rendering or interpreting them.
>
> Some terminal protocols can also use other legacy ASCII separators such as 
> FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT 
> pairs for encapsulating whole text documents in an protocol-specific 
> enveloppe format (and will also use some escaping mechanism for special 
> controls found in the middle, such as DLE+control to escape the control, or 
> DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a 
> fixed number of hexadecimal, decimal or octal digits. There's a wide variety 
> of escaping mechanisms used by various higher-layer protocols (including 
> transport protocols or encoding syntaxes used just below the plain-text 
> layer, in a lower layer than the transport protocol layer).
>
> Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode  
> a écrit :
>>
>> > Date: Mon, 4 Feb 2019 19:45:13 +
>> > From: Richard Wordingham via Unicode 
>> >
>> > Yes.  If one has a text composed of LTR and RTL 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Eli,

(I'm getting lost where to reply, and how the subject gets mangled and
the thread split into different ones.)


I've thought about it a lot, experimented with Emacs's behavior, and
I've arrived at the conclusion that we are actually much closer to
each other than I had thought. Probably there's a lot of
misunderstanding due to different terminology we used.

I've set my terminal to RTL paragraph direction (via the relevant
escape sequence), then did a "cat TUTORIAL.he" (the file taken from
26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
one, and the one running in a terminal of no BiDi.

Apart from a few minor irrelevant differences, they look the same! Hooray!!!

(The differences are:

- I had to slightly modify TUTORIAL.he to make sure none of the lines
start with a BiDi control (I added a preceding character) because
currently VTE doesn't support them, there's no character cell to store
this data. This definitely needs to be fixed in the second version of
my proposal.

- Emacs running in a terminal shows an underscore wherever there's a
BiDi control in the source file – while the graphical one doesn't.
This looks like a simple bug to me, right?

- Line 1007, the copyright line of this file uses visual indentation,
and Emacs detects LTR paragraph for that line. I think it should
rather use BiDi controls to have an overall RTL paragraph direction
detected, and within that BiDi controls to force LTR for the text. The
terminal shows it with RTL direction, as I manually set it.

Again, all these three details are irrelevant to my point, namely that
in WIP gnome-terminal it looks the same as in Emacs.)


You define paragraphs as emptyline-separated blocks on which you
perform autodetection of the paragraph direction. This is great! As
I've mentioned, I'd love to have such a mode in terminals, but it's
subject to underlying improvements, like knowing when a prompt starts
and ends, because prompts also have to be paragraph delimiters. You
convinced me that it's much more important than I thought, thanks a
lot for that! I will try to see if I can push for addressing the
prerequisite issues sooner. Indeed I had to manually set RTL paragraph
direction; with manual LTR or with per-line autodetection (as VTE can
do now) the result would be much worse.


Here's how the story continues from here. Here is where we
misunderstood each other (or at the very least I misunderstood you),
although we are talking about the same, doing things the same way:

The BiDi algorithm takes a paragraph of text at a time, and somehow
reshuffles its letters. UAX#9 section 3 starts by saying that the
first main phase is separation into "paragraphs". What are those
"paragraphs" that we're takling about _now_?

The thing is, both in Emacs as well as in my specification, it's a
logical line of the text (that is: delimited by single newlines). No,
in these steps, when UBA is run, the paragraph is no longer defined as
emptyline-delimited segments, it's defined as lines of the text.

To recap: The _paragraph direction_ is determined in Emacs for
emptyline-delimited segments of data, which I honestly find a great
thing, and would love to do in terminals too, alas at this point it's
blocked by some really nontrivial technical issues. But once you have
decided on a direction, each _line_ within that data is passed
separately to the BiDi algorithm to get reshuffled; this is what Emacs
does, this is what my specification says, and this is the right thing.
That is, for this step, the definition of "paragraph", as the BiDi
algorithm uses this term, is a line of the text file. This is where I
thought we had a disagreement, but we don't, we just misunderstood
each other.

-

On a nitpicking side note:

It's damn ugly not to terminate a text file with a newline. Newline is
much better thought of a "terminator" than a "delimiter". For example,
if you do a "cat file1 file2", you expect file2 to start on its own
line.

Shouldn't this apply to paragraphs, too, especially when BiDi is in
the game? I'd argue that an empty line (double newline) shouldn't be a
delimiter, it should be a terminator for a paragraph. I think "cat
file1 file2" should make sure that the last paragraph of file1 and the
first paragraph of file2 are printed as separate paragraphs
(potentially with different paragraph direction), shouldn't it? I'd
argue that if a text file is formatted like TUTORIAL.he, with empty
lines denoting paragraph boundaries, then it should also end in an
empty line (that is: two newline characters).

-

Feel free to skip the rest :)

Let's make a thought experiment. Let's assume that for running the
BiDi algorithm, we'd still stick to the emptyline-delimited paragraph
definition. This is not what you do, this is not what I do, but I
misunderstood that this is what you did, and I also thought this was a
good idea as a potential extension for the BiDi specs – I no longer
think so. This definition is truly problematic, as I'll 

Re: mildly OT from bidi - curious email

2019-02-06 Thread Arthur Reutenauer via Unicode
On Wed, Feb 06, 2019 at 02:30:24PM +, Julian Bradfield via Unicode wrote:
> So far, so common. The curious thing is that the (entirely
> ASCII) company name was enclosed in a left-to-right direction, thus:
> 
> Subject: Your Aaa Ltd receipt [#-]
> 
> where  and  are the bidi control characters.
> 
> I don't think I've seen this before - I wonder why it happened?

  Maybe Stripe stores merchant names with surrounding bidi control
characters, so that they’re always rendered in the appropriate
direction, even by systems that don’t implement the bidi algorithm?
Since the subject is clearly generated automatically from at least three
different sources, I can imagine wanting this sort of weak guarantee
that merchant names are always marked with the correct writing
direction, even if they’re embedded in a different-language string.  The
directional characters would only need to be added once.

Best,

Arthur


mildly OT from bidi - curious email

2019-02-06 Thread Julian Bradfield via Unicode
The current bidi discussion prompts me to post a curiosity I received
today.

I ordered something from a (UK) company, and the payment receipt came
via Stripe. So far, so common. The curious thing is that the (entirely
ASCII) company name was enclosed in a left-to-right direction, thus:

Subject: Your Aaa Ltd receipt [#-]

where  and  are the bidi control characters.

I don't think I've seen this before - I wonder why it happened?

Also today I got an otherwise ASCII message where every paragraph
started with BOM (or ZWNBSP as my font prefers to call it). I see from
the web that people used to do this - anybody know what the most
common software packages that do it are?



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.