Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> Date: Thu, 7 Feb 2019 22:35:23 + > From: Richard Wordingham via Unicode > > > > Do you mean you aim to maintain a regex that matches everyone's > > > prompt in the world, without a significant amount of false positive > > > matches on non-prompt lines? > > > Yes. > > Wow! You'll do well to match a prompt such as '2p ', which I used for > a while. Like I said: for any reasonable prompt that doesn't match, you can report a bug, and have the Emacs maintainers deliberate whether your case is important enough to be supported by default. Failing that, you can set the regexp to a suitable value in a mode hook defined on your init file.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
On Thu, 07 Feb 2019 22:00:20 +0200 Eli Zaretskii via Unicode wrote: > > From: Egmont Koblinger > > Date: Thu, 7 Feb 2019 19:01:33 +0100 > > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > > > No, it needs no interaction. Unless the regexp doesn't work for > > > you, which you should then report as a bug in Emacs. > > Do you mean you aim to maintain a regex that matches everyone's > > prompt in the world, without a significant amount of false positive > > matches on non-prompt lines? > Yes. Wow! You'll do well to match a prompt such as '2p ', which I used for a while. Richard.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> From: Egmont Koblinger > Date: Thu, 7 Feb 2019 19:01:33 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > > > No, it needs no interaction. Unless the regexp doesn't work for you, > > which you should then report as a bug in Emacs. > > Do you mean you aim to maintain a regex that matches everyone's prompt > in the world, without a significant amount of false positive matches > on non-prompt lines? Yes.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii wrote: > No, it needs no interaction. Unless the regexp doesn't work for you, > which you should then report as a bug in Emacs. Do you mean you aim to maintain a regex that matches everyone's prompt in the world, without a significant amount of false positive matches on non-prompt lines? (It's getting damn off-topic though.) e.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> From: Egmont Koblinger > Date: Thu, 7 Feb 2019 18:20:02 +0100 > Cc: Richard Wordingham , > unicode Unicode Discussion > > > It uses a regular expression, see term-prompt-regexp. > > So, it's not automatic, needs user interaction No, it needs no interaction. Unless the regexp doesn't work for you, which you should then report as a bug in Emacs.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi, On Thu, Feb 7, 2019 at 3:27 PM Eli Zaretskii wrote: > It uses a regular expression, see term-prompt-regexp. So, it's not automatic, needs user interaction, and for that reason, may not have worked for me. (I have other weird things in my prompt, like 256-color sequences that Emacs didn't recognize, perhaps this made the regexp matching fail. Nevermind.) > > Whatever it does to know where the prompt is, can it be made into a > > standard, cross-terminal feature? > > Not sure. It's a kind of heuristic, which is why the regexp is > customizable on user level, so that users could adapt it to their > needs, should that be necessary. iTerm2 has a "shell integration" where the prompt contains explicit markers so that no heuristics or user configuration is needed from the terminal. We're trying to somewhat standardize it at https://gitlab.freedesktop.org/terminal-wg/specifications/issues/4 and get more terminals support it. Not sure where this attempt will take us, we'll see. > In what version of Emacs is that? In the latest version 26 I have > here, the tutorial displays with most paragraphs in RTL direction. 25.2 here, it might have obviously changed for a newer version, glad to hear it. My distro will upgrade in about 2 months. Since I'm not an Emacs user myself, I hope you don't mind if I don't make extra rounds in upgrading now to verify this. cheers, egmont
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> Date: Thu, 7 Feb 2019 00:45:55 +0100 > Cc: unicode Unicode Discussion > From: Egmont Koblinger via Unicode > > > Not necessarily. One could allow the first strong character in the > > prompt to determine the paragraph directions > > How does Emacs know what's a prompt? How can it tell it from the > previous and next command's output? It uses a regular expression, see term-prompt-regexp. > Whatever it does to know where the prompt is, can it be made into a > standard, cross-terminal feature? Not sure. It's a kind of heuristic, which is why the regexp is customizable on user level, so that users could adapt it to their needs, should that be necessary. > > That's what the Emacs > > terminal (invoked by M-x term; top level definition in term.el) does. > > I tried it. Executed my default shell, and inside that, a "cat > TUTORIAL.he". All the paragraphs are rendered as LTR ones, > left-aligned. Not the way the file is opened in Emacs. In what version of Emacs is that? In the latest version 26 I have here, the tutorial displays with most paragraphs in RTL direction. > If you claim Emacs's built-in terminal emulator supports BiDi, I'm > kindly asking you to present a documentation of its behavior, in > similar spirit to my BiDi proposal. The Emacs terminal emulator displays text as any other text in any other Emacs buffer, so it supports the same bidi reordering as elsewhere. You could make it emulate other terminals by setting the variable bidi-paragraph-direction to either left-to-right or right-to-left, then all the paragraphs will have the base direction you specify. But the default value of this variable in term buffers is nil, which invokes dynamic determination of base paragraph direction.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> Date: Wed, 6 Feb 2019 23:32:43 + > From: Richard Wordingham via Unicode > > > You define paragraphs as emptyline-separated blocks on which you > > perform autodetection of the paragraph direction. This is great! As > > I've mentioned, I'd love to have such a mode in terminals, but it's > > subject to underlying improvements, like knowing when a prompt starts > > and ends, because prompts also have to be paragraph delimiters. > > Not necessarily. One could allow the first strong character in the > prompt to determine the paragraph directions. That's what the Emacs > terminal (invoked by M-x term; top level definition in term.el) does. Emacs's built-in terminal emulator does that only because no one bothered to do something about this behavior. I personally don't consider this the correct behavior (but then I don't use M-x term in Emacs except for testing). Emacs does know where the prompt is, so it could implement the rule that whatever follows the prompt starts a new paragraph.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
On Thu, 7 Feb 2019 00:45:55 +0100 Egmont Koblinger via Unicode wrote: > Hi Richard, > > > Not necessarily. One could allow the first strong character in the > > prompt to determine the paragraph directions > > How does Emacs know what's a prompt? How can it tell it from the > previous and next command's output? I don't believe the Emacs terminal does either. What's special about the prompt is that it starts a line, so most paragraphs start with a prompt. Not all prompts contain a strong character. To let a file's contents control directionality, instead of issuing the command 'cat file1' one would have to issue a shell command '(echo; cat file1)' or similar to terminate the paragraph containing the prompt. The 'echo' inserts an empty line. > > That's what the Emacs > > terminal (invoked by M-x term; top level definition in term.el) > > does. > > I tried it. Executed my default shell, and inside that, a "cat > TUTORIAL.he". All the paragraphs are rendered as LTR ones, > left-aligned. Not the way the file is opened in Emacs. See above. I don't know how what your shell is. > If you claim Emacs's built-in terminal emulator supports BiDi, I'm > kindly asking you to present a documentation of its behavior, in > similar spirit to my BiDi proposal. I've a feeling it has emergent behaviour, and may require a lot of experimentation to elucidate. > Does this logic also apply to single newline characters? If not, why > not, what's the conceptual difference? If it does, why do text files > end in a newline? I don't like the convention that removing the newline from the end of a non-empty line changes it into a binary file. The short answer is that some editors allow a text file not to have a final newline; such files are not handled well in the Unix environment. Some things are just untidy messes. Compare C, where a semicolon *terminates* statements, but some are terminated by '}', and a semicolon *separates* the expression within the control part of a for statement, and a comma *separates* the constant definitions in an enum declaration - for a long time, a trailing comma inside the braces was illegal. Richard.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi Richard, > Not necessarily. One could allow the first strong character in the > prompt to determine the paragraph directions How does Emacs know what's a prompt? How can it tell it from the previous and next command's output? Whatever it does to know where the prompt is, can it be made into a standard, cross-terminal feature? > That's what the Emacs > terminal (invoked by M-x term; top level definition in term.el) does. I tried it. Executed my default shell, and inside that, a "cat TUTORIAL.he". All the paragraphs are rendered as LTR ones, left-aligned. Not the way the file is opened in Emacs. If you claim Emacs's built-in terminal emulator supports BiDi, I'm kindly asking you to present a documentation of its behavior, in similar spirit to my BiDi proposal. > Not necessarily. One might use cat to glue together files that had > split into 1400k chunks, in which case it is not even reasonable to > expect the end of file to be at a character boundary. (Yes, floppy > disks still have their uses.) I did not say anything about changing cat's behavior. I recommended to change the convention for such paragraph-oriented text files to end with two newlines. > But the white space between paragraphs is a separator, not a > terminator. One doesn't require it at the end when formatting > paragraphs within the cell of a table. Does this logic also apply to single newline characters? If not, why not, what's the conceptual difference? If it does, why do text files end in a newline? e.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
On Wed, 6 Feb 2019 22:01:59 +0100 Egmont Koblinger via Unicode wrote: > Hi Eli, > > (I'm getting lost where to reply, and how the subject gets mangled and > the thread split into different ones.) > > > I've thought about it a lot, experimented with Emacs's behavior, and > I've arrived at the conclusion that we are actually much closer to > each other than I had thought. Probably there's a lot of > misunderstanding due to different terminology we used. > > I've set my terminal to RTL paragraph direction (via the relevant > escape sequence), then did a "cat TUTORIAL.he" (the file taken from > 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical > one, and the one running in a terminal of no BiDi. > > Apart from a few minor irrelevant differences, they look the same! > Hooray!!! > > (The differences are: > > - I had to slightly modify TUTORIAL.he to make sure none of the lines > start with a BiDi control (I added a preceding character) because > currently VTE doesn't support them, there's no character cell to store > this data. This definitely needs to be fixed in the second version of > my proposal. > > - Emacs running in a terminal shows an underscore wherever there's a > BiDi control in the source file – while the graphical one doesn't. > This looks like a simple bug to me, right? > > - Line 1007, the copyright line of this file uses visual indentation, > and Emacs detects LTR paragraph for that line. I think it should > rather use BiDi controls to have an overall RTL paragraph direction > detected, and within that BiDi controls to force LTR for the text. The > terminal shows it with RTL direction, as I manually set it. > > Again, all these three details are irrelevant to my point, namely that > in WIP gnome-terminal it looks the same as in Emacs.) > > > You define paragraphs as emptyline-separated blocks on which you > perform autodetection of the paragraph direction. This is great! As > I've mentioned, I'd love to have such a mode in terminals, but it's > subject to underlying improvements, like knowing when a prompt starts > and ends, because prompts also have to be paragraph delimiters. Not necessarily. One could allow the first strong character in the prompt to determine the paragraph directions. That's what the Emacs terminal (invoked by M-x term; top level definition in term.el) does. > On a nitpicking side note: > > It's damn ugly not to terminate a text file with a newline. Newline is > much better thought of a "terminator" than a "delimiter". For example, > if you do a "cat file1 file2", you expect file2 to start on its own > line. Not necessarily. One might use cat to glue together files that had split into 1400k chunks, in which case it is not even reasonable to expect the end of file to be at a character boundary. (Yes, floppy disks still have their uses.) > Shouldn't this apply to paragraphs, too, especially when BiDi is in > the game? I'd argue that an empty line (double newline) shouldn't be a > delimiter, it should be a terminator for a paragraph. But the white space between paragraphs is a separator, not a terminator. One doesn't require it at the end when formatting paragraphs within the cell of a table. Richard.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi, I was loose with my terminology once again, which is not a wise thing when you're trying to clarify misunderstandings :) > But once you have > decided on a direction, each _line_ within that data is passed > separately to the BiDi algorithm to get reshuffled; this is what Emacs > does, this is what my specification says, and this is the right thing. > That is, for this step, the definition of "paragraph", as the BiDi > algorithm uses this term, is a line of the text file. I keep thinking of the BiDi algorithm as one that takes a single paragraph, because that's how I use it in VTE. But in fact, the BiDi algorithm starts by splitting into paragraphs. I keep forgetting about this outermost "for loop" of the BiDi algo. And with that, proper definition, you can of course pass the entire emptyline-delimited segment into the BiDi algorithm in a single step. In its first phase, the BiDi algorithm will split it at newlines, because for the BiDi algorithm (but not when detecting the paragraph direction in Emacs), newline is the paragraph delimiter. Then it will execute the rest of the algorithm for each paragraph (that is: line) separately. This is exactly the same as splitting manually, and then for each line invoking the BiDi algorithm. cheers, egmont
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
Hi Eli, (I'm getting lost where to reply, and how the subject gets mangled and the thread split into different ones.) I've thought about it a lot, experimented with Emacs's behavior, and I've arrived at the conclusion that we are actually much closer to each other than I had thought. Probably there's a lot of misunderstanding due to different terminology we used. I've set my terminal to RTL paragraph direction (via the relevant escape sequence), then did a "cat TUTORIAL.he" (the file taken from 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical one, and the one running in a terminal of no BiDi. Apart from a few minor irrelevant differences, they look the same! Hooray!!! (The differences are: - I had to slightly modify TUTORIAL.he to make sure none of the lines start with a BiDi control (I added a preceding character) because currently VTE doesn't support them, there's no character cell to store this data. This definitely needs to be fixed in the second version of my proposal. - Emacs running in a terminal shows an underscore wherever there's a BiDi control in the source file – while the graphical one doesn't. This looks like a simple bug to me, right? - Line 1007, the copyright line of this file uses visual indentation, and Emacs detects LTR paragraph for that line. I think it should rather use BiDi controls to have an overall RTL paragraph direction detected, and within that BiDi controls to force LTR for the text. The terminal shows it with RTL direction, as I manually set it. Again, all these three details are irrelevant to my point, namely that in WIP gnome-terminal it looks the same as in Emacs.) You define paragraphs as emptyline-separated blocks on which you perform autodetection of the paragraph direction. This is great! As I've mentioned, I'd love to have such a mode in terminals, but it's subject to underlying improvements, like knowing when a prompt starts and ends, because prompts also have to be paragraph delimiters. You convinced me that it's much more important than I thought, thanks a lot for that! I will try to see if I can push for addressing the prerequisite issues sooner. Indeed I had to manually set RTL paragraph direction; with manual LTR or with per-line autodetection (as VTE can do now) the result would be much worse. Here's how the story continues from here. Here is where we misunderstood each other (or at the very least I misunderstood you), although we are talking about the same, doing things the same way: The BiDi algorithm takes a paragraph of text at a time, and somehow reshuffles its letters. UAX#9 section 3 starts by saying that the first main phase is separation into "paragraphs". What are those "paragraphs" that we're takling about _now_? The thing is, both in Emacs as well as in my specification, it's a logical line of the text (that is: delimited by single newlines). No, in these steps, when UBA is run, the paragraph is no longer defined as emptyline-delimited segments, it's defined as lines of the text. To recap: The _paragraph direction_ is determined in Emacs for emptyline-delimited segments of data, which I honestly find a great thing, and would love to do in terminals too, alas at this point it's blocked by some really nontrivial technical issues. But once you have decided on a direction, each _line_ within that data is passed separately to the BiDi algorithm to get reshuffled; this is what Emacs does, this is what my specification says, and this is the right thing. That is, for this step, the definition of "paragraph", as the BiDi algorithm uses this term, is a line of the text file. This is where I thought we had a disagreement, but we don't, we just misunderstood each other. - On a nitpicking side note: It's damn ugly not to terminate a text file with a newline. Newline is much better thought of a "terminator" than a "delimiter". For example, if you do a "cat file1 file2", you expect file2 to start on its own line. Shouldn't this apply to paragraphs, too, especially when BiDi is in the game? I'd argue that an empty line (double newline) shouldn't be a delimiter, it should be a terminator for a paragraph. I think "cat file1 file2" should make sure that the last paragraph of file1 and the first paragraph of file2 are printed as separate paragraphs (potentially with different paragraph direction), shouldn't it? I'd argue that if a text file is formatted like TUTORIAL.he, with empty lines denoting paragraph boundaries, then it should also end in an empty line (that is: two newline characters). - Feel free to skip the rest :) Let's make a thought experiment. Let's assume that for running the BiDi algorithm, we'd still stick to the emptyline-delimited paragraph definition. This is not what you do, this is not what I do, but I misunderstood that this is what you did, and I also thought this was a good idea as a potential extension for the BiDi specs – I no longer think so. This definition is truly problematic, as I'll
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> Date: Tue, 5 Feb 2019 00:05:47 + > From: Richard Wordingham via Unicode > > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > > by paragraph separator characters. This means characters whose bidi > > > category is B, which includes Newline, the CR-LF pair on Windows, > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > > It actually gives two different definitions. Table UAX#9 4 restricts > the type B to *appropriate newline functions; not all newlines are > paragraph separators. For what exactly is "appropriate newline function" one should read the Unicode Standard, section 5.8. My conclusions from that are different from yours; see below. > > Indeed, this was an oversight on my side. So, with this definition, > > every single newline character starts a new paragraph. The result of > > printf "Hello\nWorld\n" > world.txt > > is a text file consisting of two paragraphs, with 5 characters in > > each. Correct? > > No, it depends on when a newline function is 'appropriate'. TUS 5.8 > Rule R2b applies - 'In simple text editors, interpret any NLF the same > as LS'. That's not all of what the Standard says. Just a couple of paragraphs above Rule R2b, there's this text: Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them. So in practice, IMO the above example does constitute 2 paragraphs, regardless of the underlying platform's conventions.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
> From: Egmont Koblinger > Date: Tue, 5 Feb 2019 00:08:10 +0100 > Cc: unicode@unicode.org > > every single newline character starts a new paragraph. The result of > printf "Hello\nWorld\n" > world.txt > is a text file consisting of two paragraphs, with 5 characters in each. > Correct? Yes. > > Actually, Emacs implements the rule that paragraphs are separated by > > empty lines. This is documented in the Emacs manuals. > > That is, Emacs overrides UAX#9 and comes up with a different > definition? Yes, Emacs uses the "higher-level protocols" clause in HL1, when the paragraph direction is to be determined from the text. (There's also a way for the user or a Lisp program to force a certain base paragraph direction on all paragraphs in a window that displays some text.) > Furthermore, you argue that in terminals I should follow > Emacs's definition rather than Unicode's? IME, what Emacs uses gives much better results, yes. > I believe I understand your concerns with the per-line paragraph > definition, but this interpretation that I've just shown most likely > leads to even more broken behavior. I don't see how the result could be more broken, when the decisions about base paragraph direction are made much more rarely. The places in text where the paragraph direction will be determined under my proposal is a small subset of the places where it will be determined by the default UBA rules. So it will make the same mistakes as the each-line-is-a-new-paragraph method, but there will be much fewer of such mistakes. In addition to this theoretical argument, I have 10 years of using this in Emacs to back me up. The only difference between Emacs and your example is the very first paragraph. > It's a really nontrivial technical problem to let the terminal > emulator know where each prompt, and/or each command's output begins > and ends. There's work going on for letting the terminal emulator > recognize the prompts, but even if it's successful, it'll probably > take 5-10 years to reach the majority of the users. And it probably > still wouldn't solve the case of knowing the boundary between the two > outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if > they're concatenated with "cat file1.txt file2.txt". I think you are trying to find a perfect solution, and because it probably doesn't exist, or at least is hard to come by, you conclude that a solution that is imperfect should be rejected. But I'm not saying my proposal is the perfect solution, just that it is better (sometimes, way better) than the default of considering each line a paragraph. > So, what you're arguing for, is that the default behavior should be > something that's: > - currently not implementable in a semantically correct way (to stop > around shell prompts) due to technical limitations, and > - isn't what Unicode says. The first point has to do with the search for a perfect solution. My advice is to settle for something reasonable even if it is not perfect. The second point is incorrect: the UBA explicitly allows the implementation to apply higher-level protocols for paragraph direction, see HL1 in UAX#9. > You have not convinced me that the pros outweigh the cons. There are no cons in my proposal that aren't already present in the default each-line-is-a-new-paragraph rule. So even if the pros don't outweigh the cons, the balance should be better than under the default. > That being said, I'm more than open to see such a behavior as a > future extension, subject of course to the semantic prompt stuff > being available. I think the default should provide reasonably good display, and each-line-is-a-new-paragraph doesn't.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
Hi Eli, > IME, this is a grave mistake. I hope I explained why; it is now up to > you to decide what to do about that. Let me share one more thought. I have to admit, I'm not an Emacs user, I only have some vague ideas how powerful a tool it is. But in its very core I still believe it's a text editor – is it fair to say this? It could be used for example to conveniently create TUTORIAL.he. I'm not aware of all the kinds of works you can do in Emacs, but I have a feeling that the kind of work you do in a terminal emulator is potentially more diverse. (Let's not nitpick that a terminal can run emacs and emacs has a terminal inside so mathematically speaking it's all the same...) "cat TUTORIAL.he" is indeed one of the commands you can execute in a terminal, and unfortunately, given what terminals currently understand from their contents, I just cannot make it display as you would prefer (and I agree would make a lot of sense). But it's just one use case. There are plenty of line-oriented tools. Think of "head" and "tail". They operate on lines of files, which end up being paragraphs in the terminal according to my definition. According to your definition, they could cut a paragraph in half, they could render differently than as if the entire file was printed. According to my definition, you'll always get the same visual repsesentation, just on the given fragment of the file. Think of "grep", possibly combined with "-r" to process files recursively, and "-C" to print context lines. Not only it can cut paragraphs (of your definition) in half when it displays the matching line (plus context), but also how would you locate in its output when it switches from one match's context to the next match's context within the same file, or to a match in another file? How would you define a paragraph, and how would you define the bigger unit on which the paragraph direction is guessed? I think it's again a use case where my definition of paragraph is less problematic than yours. Think of ad-hoc shell scripts that use "echo"/"printf" to inform the user, "read" to read data etc. Or utilities written in C or whatever that don't care about terminals at all, just print output. In these cases there's no one formatting / wrapping at 80 columns performed by the app. A logical segment is typically printed as a single line, which will be wrapped by the terminal if doesn't fit in the current width (and in some terminals rewrapped when the terminal is resized), this matches my definition of paragraph. There's rarely an empty line injected in these cases; if there is, it is most likely to separate some even bigger semantical units. There are just sooo many use cases, it's impossible to perfectly address all of them at once. "cat TUTORIAL.he" is just one of them, not necessarily the most typical, not necessarily the one that should drive the BiDi design. Let's note that the four "BiDi-aware" terminals that I could test all define paragraphs as lines – I mean visual lines on their own canvas. If the terminal is 80 characters wide, and a utility prints a line of 100 characters, it'll obviously wrap into 80+20 characters. And then these terminals treat them as two separate paragraphs, one with 80 characters and one with 20, and run BiDi separately on them. I'm confident that my specification which says that it should be preserved as a 100 character long paragraph and passed to BiDi accordingly is already a significant step forward. cheers, egmont
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
On Tue, 5 Feb 2019 00:08:10 +0100 Egmont Koblinger via Unicode wrote: > Hi Eli, > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > by paragraph separator characters. This means characters whose bidi > > category is B, which includes Newline, the CR-LF pair on Windows, > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. It actually gives two different definitions. Table UAX#9 4 restricts the type B to *appropriate newline functions; not all newlines are paragraph separators. > Indeed, this was an oversight on my side. So, with this definition, > every single newline character starts a new paragraph. The result of > printf "Hello\nWorld\n" > world.txt > is a text file consisting of two paragraphs, with 5 characters in > each. Correct? No, it depends on when a newline function is 'appropriate'. TUS 5.8 Rule R2b applies - 'In simple text editors, interpret any NLF the same as LS'. > > Actually, Emacs implements the rule that paragraphs are separated by > > empty lines. This is documented in the Emacs manuals. > > That is, Emacs overrides UAX#9 and comes up with a different > definition? Furthermore, you argue that in terminals I should follow > Emacs's definition rather than Unicode's? Or please clarify if I > misunderstood you here. He's deriving 'B' from a protocol. Richard.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
Hi Eli, > Actually, UAX#9 defines "paragraph" as the chunk of text delimited by > paragraph separator characters. This means characters whose bidi > category is B, which includes Newline, the CR-LF pair on Windows, > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. Indeed, this was an oversight on my side. So, with this definition, every single newline character starts a new paragraph. The result of printf "Hello\nWorld\n" > world.txt is a text file consisting of two paragraphs, with 5 characters in each. Correct? > Actually, Emacs implements the rule that paragraphs are separated by > empty lines. This is documented in the Emacs manuals. That is, Emacs overrides UAX#9 and comes up with a different definition? Furthermore, you argue that in terminals I should follow Emacs's definition rather than Unicode's? Or please clarify if I misunderstood you here. > > while Emacs itself is a viewer that treats runs between single > > newlines as paragraphs. That is, Emacs is inconsistent with itself. > > Incorrect. Emacs always treats a run of text between empty lines as a > single paragraph, in TUTORIAL.he and everywhere else. There's nothing > special about TUTORIAL.he, it is just a plain text file with a few > dozen of bidi formatting controls, needed to show the key sequences > with weak and neutral characters in correct visual order. [...] Thanks for the clarification, I believe it's clear to me now. > At least with Emacs, it is not the same. I think considering each > line as a separate paragraph makes writing bidi plain-text documents > that look right almost impossible, if each line ends in a newline [...] > My personal recommendation is to adopt theempty line rule. It's > simple enough and gives good results IME. [...] > I'm surprised that you describe this as such a complex problem. I > think you explained up-thread that terminal emulators should cope with > lines of text arriving piecemeal, which I interpreted as meaning that > text is stored in the emulator's memory. Modern emulators running on > windowed desktops also provide scroll-back buffers, and react to > expose events. So I think the text that is currently in the viewport, > and also some text previously shown, are stored in memory, and can be > consulted. The problem is not the memory management. Let's look at the following session: ---snip--- prompt$ cat file1.txt This is the first human-perceived paragraph. And this is the second. prompt$ cat file2.txt Here this is the third paragraph. And this one is the fourth. prompt$ ---snip--- If you load the files to Emacs, it is perfectly aware of the contents of the two files. It can define paragraphs however it wants to, and BiDi the files accordingly. The terminal emulator doesn't know what's a shell prompt, what's a command that the user types, what's the output of that command. (You don't know either from this snippet. Maybe I only cat'ed file1.txt, and "prompt$ cat file2.txt" is just the sixth line of this eleven-line file.) In the terminal emulator's eyes, with Emacs's definition (empty line delimited), this is one paragraph: prompt$ cat file1.txt This is the first human-perceived paragraph. and this is another paragraph: And this is the second prompt$ cat file2.txt Here this is the third paragraph. and similarly for the third one. I believe I understand your concerns with the per-line paragraph definition, but this interpretation that I've just shown most likely leads to even more broken behavior. It's a really nontrivial technical problem to let the terminal emulator know where each prompt, and/or each command's output begins and ends. There's work going on for letting the terminal emulator recognize the prompts, but even if it's successful, it'll probably take 5-10 years to reach the majority of the users. And it probably still wouldn't solve the case of knowing the boundary between the two outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if they're concatenated with "cat file1.txt file2.txt". So, what you're arguing for, is that the default behavior should be something that's: - currently not implementable in a semantically correct way (to stop around shell prompts) due to technical limitations, and - isn't what Unicode says. You have not convinced me that the pros outweigh the cons. That being said, I'm more than open to see such a behavior as a future extension, subject of course to the semantic prompt stuff being available. cheers, egmont
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
> From: Egmont Koblinger > Date: Mon, 4 Feb 2019 00:36:23 +0100 > Cc: unicode@unicode.org > > The Unicode BiDi algorithm states that it operates on paragraphs of > text, and leaves it up to a higher protocol to define what a paragraph > exactly is. > > What's the definition of "paragraph" in the context of plain text files? > > I don't think there's a single well-established practice. Actually, UAX#9 defines "paragraph" as the chunk of text delimited by paragraph separator characters. This means characters whose bidi category is B, which includes Newline, the CR-LF pair on Windows, U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way > more complicated, probably there isn't a well-defined grammar for > how exactly bullet list entries and alike should become new > paragraphs. Actually, Emacs implements the rule that paragraphs are separated by empty lines. This is documented in the Emacs manuals. (That's by default, users and Lisp programs can control that to some extent.) This rule is global, and applied to any file or buffer, including TUTORIAL.he. > lorem ipsum FED ]> CBA foobar > > The visual representation, in a narrower viewport, might wrap for > example like this: > > lorem ipsum CBA > FED ]> foobar I suggest to leave line wrapping alone for the moment: it is a further complication. Let's first talk about text whose every line ends in a hard newline -- this is what you see in most "simple" text-mode utilities which we are talking about. If/when we solve the problems there, we can then look at the issues with wrapping. > Here comes the twist. Let's view this latter file with a viewer that > uses a _different_ definition for paragraph. Let's view it in Gedit, > Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where > every newline begins a new paragraph – that's how these viewers define > the notion of "paragraph" for the sake of BiDi. > > The visual layout in these viewers becomes: > > lorem ipsum CBA > <[ FED foobar > > which is just not correct. Since here BiDi is run on the two lines > separately, the initial "<[" is treated as LTR, placed at the wrong > location in the wrong order, and the glyphs aren't mirrored. This kind of problems happens all the time, and you cannot avoid it. Different programs display bidi text differently. I propose not to try to solve this problem, because IME it cannot be solved in general. Let's focus on the terminal emulators that should comply with your guidelines, and let's try to decide what should they do about base paragraph direction of text emitted by "simple" text utilities. If they all make decisions by the same rule, they all will show the same text identically. > Now, Emacs ships a TUTORIAL.he which, for most of its contents (but > not everywhere) seems to treat runs between empty lines as paragraphs, Correct. > while Emacs itself is a viewer that treats runs between single > newlines as paragraphs. That is, Emacs is inconsistent with itself. Incorrect. Emacs always treats a run of text between empty lines as a single paragraph, in TUTORIAL.he and everywhere else. There's nothing special about TUTORIAL.he, it is just a plain text file with a few dozen of bidi formatting controls, needed to show the key sequences with weak and neutral characters in correct visual order. (Some of those controls can probably be removed nowadays, since we now have the BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was released.) In fact, I wrote that tutorial as an exercise, to prove to myself that Emacs can be useful for editing non-trivial bidi text. > In case you think I got something wrong with Emacs: Could you please > give exact definitions: > - What are the exact units (so-called "paragraphs" by UAX9) that it > runs BiDi on when it loads and displays a file? See above: for the purpose of the Emacs UBA implementation, paragraphs are separated by empty lines. That is the only rule in EMacs regarding paragraph determination. > - What are the exact units (so-called "paragraphs" by UAX9) in > TUTORIAL.he on which BiDi needs to be run in order to get the desired > readable version? The same. There's nothing special about that file. > What most likely happens is that in order to see a difference, you'd > need to have more special symbols, or at least a more special > constellation of them. Probably TUTORIAL.he is just luckily simple > enough that such a difference isn't hit. No, TUTORIAL.he is neither "lucky" nor "simple". I deliberately used there almost every bidi formatting control there is, where appropriate, to make sure this stiff works as intended in an otherwise plain text file. > Another possibility is (and I cannot check because I can't speak > Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to > get the desired visual one. There's no cheating there, I assure you. > This definition of paragraph (stuff between a newline and