Hi Eli, (I'm responding in multiple emails.)
The Unicode BiDi algorithm states that it operates on paragraphs of text, and leaves it up to a higher protocol to define what a paragraph exactly is. What's the definition of "paragraph" in the context of plain text files? I don't think there's a single well-established practice. In some particular text files, every explicit newline character starts a new paragraph. In some (e.g. COPYING.GPL and friends), an empty line (that is: two consecutive newline characters) separates two paragraphs. In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way more complicated, probably there isn't a well-defined grammar for how exactly bullet list entries and alike should become new paragraphs. In the output of "dpkg -s packagename" consecutive lines indented by 1 space – except for those where there's only a single dot after the space – form the human-perceived paragraphs. There are sure several other syntaxes out there. If the producer of a text file uses a different definition than the viewer software, bugs can arise. I think this should be intuitively obvious, but just in case, let me give a concrete example. In this example I'll assume LTR paragraph direction set up by some external means; with autodetected paragraph direction it's much easier to come up with such breakages. I wish to store and deliver the following text, as it's layed out here in logical order. That is, the order as the bytes appear in the text file, as I typed them from the keyboard, is laid out here strictly from left to right, with uppercase standing for RTL letters, and no mirroring: lorem ipsum ABC <[ DEF foobar The visual representation, what I expect to see in any decent viewer software, is this one according to the BiDi algorithm this: lorem ipsum FED ]> CBA foobar The visual representation, in a narrower viewport, might wrap for example like this: lorem ipsum CBA FED ]> foobar which is still correct, given that logical "ABC <[ DEF" is a single RTL run. (This assumes a viewer which, unlike Emacs, follows the Unicode BiDi algorithm for wrapping a paragraph into multiple lines.) Let's assume that me, as the producer of the text file, wish to create a typical README in the spirit of COPYING.GPL and similar text files, with the paragraph definition that two consecutive newline characters (that is: a single empty line) delimit paragraphs; and a single newline is equivalent to a space. Since I'd prefer to keep a margin of 16 characters in the source file (for demo purposes), I can take the liberty of replacing the space after "ABC" by a single newline. (Maybe my text editor does this automatically.) The file's contents, again the logical order laid out from left to right, top to bottom, becomes this: lorem ipsum ABC <[ DEF foobar This file, accoring to the paragraph definition chosen earlier, is equivalent to the unwrapped version shown before, and thus should convey the same message. If I view this file in a piece of software which uses the same paragraph definition for BiDi purposes, the contents will appear as expected. An example for such a viewer is a markdown converter's (that leaves single newlines as-is, and adds a "<p>" at double newlines) output viewed as an html file in a browser. Here comes the twist. Let's view this latter file with a viewer that uses a _different_ definition for paragraph. Let's view it in Gedit, Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where every newline begins a new paragraph – that's how these viewers define the notion of "paragraph" for the sake of BiDi. The visual layout in these viewers becomes: lorem ipsum CBA <[ FED foobar which is just not correct. Since here BiDi is run on the two lines separately, the initial "<[" is treated as LTR, placed at the wrong location in the wrong order, and the glyphs aren't mirrored. Now, Emacs ships a TUTORIAL.he which, for most of its contents (but not everywhere) seems to treat runs between empty lines as paragraphs, while Emacs itself is a viewer that treats runs between single newlines as paragraphs. That is, Emacs is inconsistent with itself. In case you think I got something wrong with Emacs: Could you please give exact definitions: - What are the exact units (so-called "paragraphs" by UAX9) that it runs BiDi on when it loads and displays a file? - What are the exact units (so-called "paragraphs" by UAX9) in TUTORIAL.he on which BiDi needs to be run in order to get the desired readable version? What most likely happens is that in order to see a difference, you'd need to have more special symbols, or at least a more special constellation of them. Probably TUTORIAL.he is just luckily simple enough that such a difference isn't hit. Another possibility is (and I cannot check because I can't speak Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to get the desired visual one. ----- Now, back to terminals. The smallest possible viable definition of a "paragraph" in terminal emulators is stuff between one newline and the next one. It would require a hell lot of work, redesigning (overcomplicating) plenty of basics of terminal emulation to be able to come up with smaller units, e.g. cells of a table – a concept that doesn't currently exist in this world –, I don't find any such approach feasible at all. This definition of paragraph (stuff between a newline and the next one) is the same as the one of Gedit, Emacs etc. when it comes to displaying BiDi text. Now, it's possible to ponder about other, larger units as possible definitions. For certain files, surely the right approach would be to treat parts delimited by empty lines as paragraphs. But how far should we go? Should terminals understand markdown (one of the most terrible grammars I've ever seen) and all its popular flavors? Should it understand Emacs's TUTORIAL.he? Should it understand dpkg's format? What else? There's another conceptual problem here. Most terminal emulators don't understand a single bit of what happens inside them. They don't know where an application's output begins, where it ends. They don't know where the shell prompt is. In fact, they have no idea what a shell prompt is. They only see a single stream of incoming data to process (print printable characters, and obey to control instructions). With the paragraph definition of "between a newline and the next one" this is not a problem, everything is doable based on what terminals already know. With any other definition, e.g. if you define paragraphs as "separated by empty lines", still I'm sure you'd need the shell prompt to terminate the previous paragraph, start a new one (the prompt's and command line's), and even below the command line where the next utility's output begins it would also need to start a new paragraph. But we just don't have this information now. There are extensions used by some terminal emulators, and perhaps they'll get "standardized" and more widely adopted to at least let the terminal emulator know where the shell prompt and command line begins and ends. But even if they're adopted by many emulators, there's still a problem: is it going to be the shells (binaries) emit these themselves, or should the user configure the prompt to contain them? It's quite unlikely that we'll have buy-in from all the popular shells. The prompts are maintained by all the users themselves, with .bashrc or so defining them, this file is copied over from /etc/skel once and then cannot be updated by distributions. Even if it's going to happen, it'll take many-many years to come until we can safely rely on this information being generally available. For the problem set of having the same paragraph direction for multiple paragraphs (e.g. an entire file, as cat'ed), we're also hit by this limitation. Once the knowledge of where a command's output begins and ends becomes available, we'll be able to do this, for example say that the direction is autodetected on the command's output as one unit, but then BiDi is applied on each line or each emptyline-delimited fragment. We just don't have the necessary information now, and won't have for a looong time. This is why the only reasonable thing I can imagine is to define paragraph as newline-delimited segments, and leave it up for future enhancements to introduce other "paragraph" definitions as further options. cheers, egmont

