Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)
I read your email, you spoke for example about how a typical Unix/Linux tool shows its usage option (e.g. "anycommand --help") with a leading line then syntaxes and tabulated lists of options followed by translated help on the same line. There's some rules for correct display including with Bidi: - Separate paragraphs that need a different default Bidi by double newlines (to force a hard break) - use a single newline on continuation - if technical items are untranslatable, make sure they are at the begining of lines and indented by some leading spaces, before translated ones. - avoid breaking lists - try to separate as much as posible text in natural languages from technical texts. - Be careful about correcty usage of leading punctuations (notably for list items) - Be consistant about indentation - Normalize spaces, - Don't ussume that TAB controls have the same width (ban TABS except at the begining of lines) - In column output, separate colums always with at least two spaces, don't glue them as if they were sentences. - Don't use "soft line breaks" in the middle of short lines (less than 72 base characters) - Don't use any Bidi control ! With some cares, you can perfectly translate Linux/Unix tools in languages needing Bidi and get consistant output, but be careful if your text contains placeholders or technihcal untranslated terms (make sure to surround them with paired punctuation, or don't translate them at all. And avoid paragraphs that would mix natural and technical untranslatable terms (such as command names or command-line options). Make sure to test the output so that it will also work with varaible fonts (don't assume monospaced fonts are used, they do not exist for various scripts and don't work reliably for Arabic and most Asian scripts, and not even for Chinese or Japanese even if these don't need Bidi support). But the difficulty is not really in the terminal emulators but in the source texts given to translators, when they don't know the context in which the text will be used and have no hint about which terms should not be translated (because they can become inconsistant: there are many examples, even in Windows 10, where some of the command line tools are completely unusable with the translated UI and with examples of syntaxes that are not even working where some terms were randomly and inconsistantly translated or confused, or because tools assumed an LTR-only layout of the output, and monospaced fonts with one-to-one character per display cell, or requiring specific fonts that do not contain the characters in their monospaced variants: this is challenging notably for Asian scripts needing complex clusters if you made these Latin-based assumptions) Le mer. 6 févr. 2019 à 22:30, Egmont Koblinger a écrit : > Hi Philippe, > > Thanks a lot for your input! > > Another fundamental difficulty with terminal emulators is: These > controls (CR, LF...) are control instructions that move the cursor in > some ways, and then are forgotten. You cannot do BiDi on the > instructions the terminal receives. You can only do BiDi on the > result, the contents of the canvas after these instructions are > executed. Here these controls are either lost, or you have to give a > specification how exactly they need to be remembered, i.e. converted > to being part of the canvas's data. > > Let's also mention that trying to get apps into using them is quite > hopeless. The best you can do is design BiDi around what you already > have, which pretty much means hard vs. soft line endings, and > hopefully forthcoming semantical marks around shell prompts. (To > overcomplicate the story, a received LF doesn't convert the line > ending to hard wrapped in most terminal emulators. In some it does. I > don't think there's an exact specification anywhere. Maybe the BiDi > spec needs to create one. Lines are hard wrapped by default, turned to > soft wrapped when the text gets wrapped at the end of the line, and a > few random control functions turn them back to hard one, but in most > terminals, a newline is not such a control function.) > > Anyway, please also see my previous email; I hope that clarifies a lot > for you, too. > > > cheers, > egmont > > On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode > wrote: > > > > I think that before making any decision we must make some decision about > what we mean by "newlines". There are in fact 3 different functions: > > - (1) soft line breaks (which are used to enforce a maximum display > width between paragraph margins): these are equivalent to breakable and > compressible whitespaces, and do not change the logical paragraph > direction, they don't insert any additionnal vertical gap between lines, so > the logicial line-height is preserved and continues uninterrupted. If text > justification applies, this whitespace will be entirely collapsed into the > end margin, and any text before it will stilol be justified to match the > end margin (until the maximum
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi Richard, > Not necessarily. One could allow the first strong character in the > prompt to determine the paragraph directions How does Emacs know what's a prompt? How can it tell it from the previous and next command's output? Whatever it does to know where the prompt is, can it be made into a standard, cross-terminal feature? > That's what the Emacs > terminal (invoked by M-x term; top level definition in term.el) does. I tried it. Executed my default shell, and inside that, a "cat TUTORIAL.he". All the paragraphs are rendered as LTR ones, left-aligned. Not the way the file is opened in Emacs. If you claim Emacs's built-in terminal emulator supports BiDi, I'm kindly asking you to present a documentation of its behavior, in similar spirit to my BiDi proposal. > Not necessarily. One might use cat to glue together files that had > split into 1400k chunks, in which case it is not even reasonable to > expect the end of file to be at a character boundary. (Yes, floppy > disks still have their uses.) I did not say anything about changing cat's behavior. I recommended to change the convention for such paragraph-oriented text files to end with two newlines. > But the white space between paragraphs is a separator, not a > terminator. One doesn't require it at the end when formatting > paragraphs within the cell of a table. Does this logic also apply to single newline characters? If not, why not, what's the conceptual difference? If it does, why do text files end in a newline? e.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
On Wed, 6 Feb 2019 22:01:59 +0100 Egmont Koblinger via Unicode wrote: > Hi Eli, > > (I'm getting lost where to reply, and how the subject gets mangled and > the thread split into different ones.) > > > I've thought about it a lot, experimented with Emacs's behavior, and > I've arrived at the conclusion that we are actually much closer to > each other than I had thought. Probably there's a lot of > misunderstanding due to different terminology we used. > > I've set my terminal to RTL paragraph direction (via the relevant > escape sequence), then did a "cat TUTORIAL.he" (the file taken from > 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical > one, and the one running in a terminal of no BiDi. > > Apart from a few minor irrelevant differences, they look the same! > Hooray!!! > > (The differences are: > > - I had to slightly modify TUTORIAL.he to make sure none of the lines > start with a BiDi control (I added a preceding character) because > currently VTE doesn't support them, there's no character cell to store > this data. This definitely needs to be fixed in the second version of > my proposal. > > - Emacs running in a terminal shows an underscore wherever there's a > BiDi control in the source file – while the graphical one doesn't. > This looks like a simple bug to me, right? > > - Line 1007, the copyright line of this file uses visual indentation, > and Emacs detects LTR paragraph for that line. I think it should > rather use BiDi controls to have an overall RTL paragraph direction > detected, and within that BiDi controls to force LTR for the text. The > terminal shows it with RTL direction, as I manually set it. > > Again, all these three details are irrelevant to my point, namely that > in WIP gnome-terminal it looks the same as in Emacs.) > > > You define paragraphs as emptyline-separated blocks on which you > perform autodetection of the paragraph direction. This is great! As > I've mentioned, I'd love to have such a mode in terminals, but it's > subject to underlying improvements, like knowing when a prompt starts > and ends, because prompts also have to be paragraph delimiters. Not necessarily. One could allow the first strong character in the prompt to determine the paragraph directions. That's what the Emacs terminal (invoked by M-x term; top level definition in term.el) does. > On a nitpicking side note: > > It's damn ugly not to terminate a text file with a newline. Newline is > much better thought of a "terminator" than a "delimiter". For example, > if you do a "cat file1 file2", you expect file2 to start on its own > line. Not necessarily. One might use cat to glue together files that had split into 1400k chunks, in which case it is not even reasonable to expect the end of file to be at a character boundary. (Yes, floppy disks still have their uses.) > Shouldn't this apply to paragraphs, too, especially when BiDi is in > the game? I'd argue that an empty line (double newline) shouldn't be a > delimiter, it should be a terminator for a paragraph. But the white space between paragraphs is a separator, not a terminator. One doesn't require it at the end when formatting paragraphs within the cell of a table. Richard.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi, I was loose with my terminology once again, which is not a wise thing when you're trying to clarify misunderstandings :) > But once you have > decided on a direction, each _line_ within that data is passed > separately to the BiDi algorithm to get reshuffled; this is what Emacs > does, this is what my specification says, and this is the right thing. > That is, for this step, the definition of "paragraph", as the BiDi > algorithm uses this term, is a line of the text file. I keep thinking of the BiDi algorithm as one that takes a single paragraph, because that's how I use it in VTE. But in fact, the BiDi algorithm starts by splitting into paragraphs. I keep forgetting about this outermost "for loop" of the BiDi algo. And with that, proper definition, you can of course pass the entire emptyline-delimited segment into the BiDi algorithm in a single step. In its first phase, the BiDi algorithm will split it at newlines, because for the BiDi algorithm (but not when detecting the paragraph direction in Emacs), newline is the paragraph delimiter. Then it will execute the rest of the algorithm for each paragraph (that is: line) separately. This is exactly the same as splitting manually, and then for each line invoking the BiDi algorithm. cheers, egmont
Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)
Hi Philippe, Thanks a lot for your input! Another fundamental difficulty with terminal emulators is: These controls (CR, LF...) are control instructions that move the cursor in some ways, and then are forgotten. You cannot do BiDi on the instructions the terminal receives. You can only do BiDi on the result, the contents of the canvas after these instructions are executed. Here these controls are either lost, or you have to give a specification how exactly they need to be remembered, i.e. converted to being part of the canvas's data. Let's also mention that trying to get apps into using them is quite hopeless. The best you can do is design BiDi around what you already have, which pretty much means hard vs. soft line endings, and hopefully forthcoming semantical marks around shell prompts. (To overcomplicate the story, a received LF doesn't convert the line ending to hard wrapped in most terminal emulators. In some it does. I don't think there's an exact specification anywhere. Maybe the BiDi spec needs to create one. Lines are hard wrapped by default, turned to soft wrapped when the text gets wrapped at the end of the line, and a few random control functions turn them back to hard one, but in most terminals, a newline is not such a control function.) Anyway, please also see my previous email; I hope that clarifies a lot for you, too. cheers, egmont On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode wrote: > > I think that before making any decision we must make some decision about what > we mean by "newlines". There are in fact 3 different functions: > - (1) soft line breaks (which are used to enforce a maximum display width > between paragraph margins): these are equivalent to breakable and > compressible whitespaces, and do not change the logical paragraph direction, > they don't insert any additionnal vertical gap between lines, so the logicial > line-height is preserved and continues uninterrupted. If text justification > applies, this whitespace will be entirely collapsed into the end margin, and > any text before it will stilol be justified to match the end margin (until > the maximum expansion of other whitespaces in the middle is reached, and the > maximum intercharacter gap is also reached (in which case, that line will not > longer be expanded more), but this does not apply to terminal emulators that > noramlly never use text justification, so the text will just be aligned to > the start margin and whitespaces before it on the same line are preserved, > and collapsed only at end of the line (just before the soft line break itself) > - (2) hard line breaks: they break to a new line but continue the paragraph > within its same logical direction, but they are not compressible whitespaces > (and do not depend on the logical end margin of the paragraph. > - (3) paragraph breaks: generally they introduce an addition vertical gap > with top and bottom margins > > The problem in terminals is that they usually cannot distinguish types (1) > and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. > Type (1) is only existing within the framework of a higher level protocol > which gives additional interpretation to these "newlines". The special > control LS is almost never used but may be used for type (1) i.e. soft > line-breaks, and will fallback to type (2) which is represented by the legacy > "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). > I have seen very little or no use of the LS (line separator) special control. > > Type (3) may be encoded with PS (paragraph separator), but in terminals (and > common protocols line MIME) it is usually encoded using a couple of newline > (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional > whitespaces (and additional presentation characters such as ">" in quotations > inserted in mail responses) between them (needed for MIME and HTTP) which may > be collapsed when rendering or interpreting them. > > Some terminal protocols can also use other legacy ASCII separators such as > FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT > pairs for encapsulating whole text documents in an protocol-specific > enveloppe format (and will also use some escaping mechanism for special > controls found in the middle, such as DLE+control to escape the control, or > DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a > fixed number of hexadecimal, decimal or octal digits. There's a wide variety > of escaping mechanisms used by various higher-layer protocols (including > transport protocols or encoding syntaxes used just below the plain-text > layer, in a lower layer than the transport protocol layer). > > Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode > a écrit : >> >> > Date: Mon, 4 Feb 2019 19:45:13 + >> > From: Richard Wordingham via Unicode >> > >> > Yes. If one has a text composed of LTR and RTL
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi Eli, (I'm getting lost where to reply, and how the subject gets mangled and the thread split into different ones.) I've thought about it a lot, experimented with Emacs's behavior, and I've arrived at the conclusion that we are actually much closer to each other than I had thought. Probably there's a lot of misunderstanding due to different terminology we used. I've set my terminal to RTL paragraph direction (via the relevant escape sequence), then did a "cat TUTORIAL.he" (the file taken from 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical one, and the one running in a terminal of no BiDi. Apart from a few minor irrelevant differences, they look the same! Hooray!!! (The differences are: - I had to slightly modify TUTORIAL.he to make sure none of the lines start with a BiDi control (I added a preceding character) because currently VTE doesn't support them, there's no character cell to store this data. This definitely needs to be fixed in the second version of my proposal. - Emacs running in a terminal shows an underscore wherever there's a BiDi control in the source file – while the graphical one doesn't. This looks like a simple bug to me, right? - Line 1007, the copyright line of this file uses visual indentation, and Emacs detects LTR paragraph for that line. I think it should rather use BiDi controls to have an overall RTL paragraph direction detected, and within that BiDi controls to force LTR for the text. The terminal shows it with RTL direction, as I manually set it. Again, all these three details are irrelevant to my point, namely that in WIP gnome-terminal it looks the same as in Emacs.) You define paragraphs as emptyline-separated blocks on which you perform autodetection of the paragraph direction. This is great! As I've mentioned, I'd love to have such a mode in terminals, but it's subject to underlying improvements, like knowing when a prompt starts and ends, because prompts also have to be paragraph delimiters. You convinced me that it's much more important than I thought, thanks a lot for that! I will try to see if I can push for addressing the prerequisite issues sooner. Indeed I had to manually set RTL paragraph direction; with manual LTR or with per-line autodetection (as VTE can do now) the result would be much worse. Here's how the story continues from here. Here is where we misunderstood each other (or at the very least I misunderstood you), although we are talking about the same, doing things the same way: The BiDi algorithm takes a paragraph of text at a time, and somehow reshuffles its letters. UAX#9 section 3 starts by saying that the first main phase is separation into "paragraphs". What are those "paragraphs" that we're takling about _now_? The thing is, both in Emacs as well as in my specification, it's a logical line of the text (that is: delimited by single newlines). No, in these steps, when UBA is run, the paragraph is no longer defined as emptyline-delimited segments, it's defined as lines of the text. To recap: The _paragraph direction_ is determined in Emacs for emptyline-delimited segments of data, which I honestly find a great thing, and would love to do in terminals too, alas at this point it's blocked by some really nontrivial technical issues. But once you have decided on a direction, each _line_ within that data is passed separately to the BiDi algorithm to get reshuffled; this is what Emacs does, this is what my specification says, and this is the right thing. That is, for this step, the definition of "paragraph", as the BiDi algorithm uses this term, is a line of the text file. This is where I thought we had a disagreement, but we don't, we just misunderstood each other. - On a nitpicking side note: It's damn ugly not to terminate a text file with a newline. Newline is much better thought of a "terminator" than a "delimiter". For example, if you do a "cat file1 file2", you expect file2 to start on its own line. Shouldn't this apply to paragraphs, too, especially when BiDi is in the game? I'd argue that an empty line (double newline) shouldn't be a delimiter, it should be a terminator for a paragraph. I think "cat file1 file2" should make sure that the last paragraph of file1 and the first paragraph of file2 are printed as separate paragraphs (potentially with different paragraph direction), shouldn't it? I'd argue that if a text file is formatted like TUTORIAL.he, with empty lines denoting paragraph boundaries, then it should also end in an empty line (that is: two newline characters). - Feel free to skip the rest :) Let's make a thought experiment. Let's assume that for running the BiDi algorithm, we'd still stick to the emptyline-delimited paragraph definition. This is not what you do, this is not what I do, but I misunderstood that this is what you did, and I also thought this was a good idea as a potential extension for the BiDi specs – I no longer think so. This definition is truly problematic, as I'll
Re: mildly OT from bidi - curious email
On Wed, Feb 06, 2019 at 02:30:24PM +, Julian Bradfield via Unicode wrote: > So far, so common. The curious thing is that the (entirely > ASCII) company name was enclosed in a left-to-right direction, thus: > > Subject: Your Aaa Ltd receipt [#-] > > where and are the bidi control characters. > > I don't think I've seen this before - I wonder why it happened? Maybe Stripe stores merchant names with surrounding bidi control characters, so that they’re always rendered in the appropriate direction, even by systems that don’t implement the bidi algorithm? Since the subject is clearly generated automatically from at least three different sources, I can imagine wanting this sort of weak guarantee that merchant names are always marked with the correct writing direction, even if they’re embedded in a different-language string. The directional characters would only need to be added once. Best, Arthur
mildly OT from bidi - curious email
The current bidi discussion prompts me to post a curiosity I received today. I ordered something from a (UK) company, and the payment receipt came via Stripe. So far, so common. The curious thing is that the (entirely ASCII) company name was enclosed in a left-to-right direction, thus: Subject: Your Aaa Ltd receipt [#-] where and are the bidi control characters. I don't think I've seen this before - I wonder why it happened? Also today I got an otherwise ASCII message where every paragraph started with BOM (or ZWNBSP as my font prefers to call it). I see from the web that people used to do this - anybody know what the most common software packages that do it are? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.