Re: Encoding italic
On Tue, 5 Feb 2019 16:01:41 + Andrew West via Unicode wrote: > You would > have to first convert any text to be italicized to NFD, then apply > VS14 to each non-combining character. This alone would make a VS > solution unacceptable in my opinion. What is so unacceptable about having to do this? Richard.
Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)
I think that before making any decision we must make some decision about what we mean by "newlines". There are in fact 3 different functions: - (1) soft line breaks (which are used to enforce a maximum display width between paragraph margins): these are equivalent to breakable and compressible whitespaces, and do not change the logical paragraph direction, they don't insert any additionnal vertical gap between lines, so the logicial line-height is preserved and continues uninterrupted. If text justification applies, this whitespace will be entirely collapsed into the end margin, and any text before it will stilol be justified to match the end margin (until the maximum expansion of other whitespaces in the middle is reached, and the maximum intercharacter gap is also reached (in which case, that line will not longer be expanded more), but this does not apply to terminal emulators that noramlly never use text justification, so the text will just be aligned to the start margin and whitespaces before it on the same line are preserved, and collapsed only at end of the line (just before the soft line break itself) - (2) hard line breaks: they break to a new line but continue the paragraph within its same logical direction, but they are not compressible whitespaces (and do not depend on the logical end margin of the paragraph. - (3) paragraph breaks: generally they introduce an addition vertical gap with top and bottom margins The problem in terminals is that they usually cannot distinguish types (1) and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. Type (1) is only existing within the framework of a higher level protocol which gives additional interpretation to these "newlines". The special control LS is almost never used but may be used for type (1) i.e. soft line-breaks, and will fallback to type (2) which is represented by the legacy "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). I have seen very little or no use of the LS (line separator) special control. Type (3) may be encoded with PS (paragraph separator), but in terminals (and common protocols line MIME) it is usually encoded using a couple of newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional whitespaces (and additional presentation characters such as ">" in quotations inserted in mail responses) between them (needed for MIME and HTTP) which may be collapsed when rendering or interpreting them. Some terminal protocols can also use other legacy ASCII separators such as FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT pairs for encapsulating whole text documents in an protocol-specific enveloppe format (and will also use some escaping mechanism for special controls found in the middle, such as DLE+control to escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal digits. There's a wide variety of escaping mechanisms used by various higher-layer protocols (including transport protocols or encoding syntaxes used just below the plain-text layer, in a lower layer than the transport protocol layer). Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode a écrit : > > Date: Mon, 4 Feb 2019 19:45:13 + > > From: Richard Wordingham via Unicode > > > > Yes. If one has a text composed of LTR and RTL paragraphs, one has to > > choose how far apart their starting margins are. I think that could > > get complicated for plain text if the terminal has unbounded width. > > But no real-life terminal does. The width is always bounded. >
Re: Bidi paragraph direction in terminal emulators
> From: Egmont Koblinger > Date: Tue, 5 Feb 2019 02:28:50 +0100 > Cc: unicode@unicode.org > > I have to admit, I'm not an Emacs user, I only have some vague ideas > how powerful a tool it is. But in its very core I still believe it's a > text editor – is it fair to say this? It could be used for example to > conveniently create TUTORIAL.he. It is a text editing/processing environment which has a lot of text-based applications built on top of it. It could (and was) used to create TUTORIAL.he, but it can and is used for much more. > There are plenty of line-oriented tools. > [...] Actually, for every utility you mention, Emacs has a command that either invokes the utility and presents its output, or does the same job by using built-in features. So most/all of the jobs you mention are routinely done in Emacs. After all, Emacs is a programmer's editor at its core, so every job programmers routinely do from the shell prompt has an equivalent feature in Emacs. You can even run shells inside Emacs, with Emacs serving as a terminal emulator (which then supports bidi ;-). > There are just sooo many use cases, it's impossible to perfectly > address all of them at once. I don't think you need to look for a perfect solution. You need to look for one that works reasonably well in practice. It is my experience in Emacs that the empty line as paragraph delimiter produces much better results than if you treat each line as a separate paragraph. We do have in Emacs features that allow to override the default paragraph direction, but experience shows that they are used relatively rarely. > I'm confident that my specification which says that it should be > preserved as a 100 character long paragraph and passed to BiDi > accordingly is already a significant step forward. I agree, but I urge you to make one more step, which IME is really necessary.
Re: Encoding italic
On Tue, 5 Feb 2019 at 15:34, wjgo_10...@btinternet.com via Unicode wrote: > > italic version of a glyph in plain text, including a suggestion of to > which characters it could apply, would test whether such a proposal > would be accepted to go into the Document Register for the Unicode > Technical Committee to consider or just be deemed out of scope and > rejected and not considered by the Unicode Technical Committee. Just reminding you that "The initial character in a variation sequence is never a nonspacing combining mark (gc=Mn) or a canonical decomposable character" (The Unicode Standard 11.0 §23.4). This means that a variation sequence cannot be defined for any precomposed letters and diacritics, so for example you could not italicize the word "fête" by simply adding VS14 after each letter because "ê" (in NFC form) cannot act as the base for a variation sequence. You would have to first convert any text to be italicized to NFD, then apply VS14 to each non-combining character. This alone would make a VS solution unacceptable in my opinion. Andrew
Re: Bidi paragraph direction in terminal emulators
> From: Egmont Koblinger > Date: Tue, 5 Feb 2019 01:32:34 +0100 > Cc: unicode@unicode.org > > On the other hand, it's not unreasonable for higher level stuff (e.g. > shell scripts, or tools like "zip") to use such control characters. Yes, but most of them won't ever do that. > > No, this simple case must work reasonably well with the application > > _completely_ oblivious to the bidi aspects. If this can't work > > reasonably well, I submit that the entire concept of having a > > bidi-aware terminal emulator doesn't "hold water". > > There isn't a magic wand. I can't magically fix every BiDi stuff by > changing the terminal emulator's source code. I didn't say "magically fix", I said "work reasonably well". I think it would be a mistake to demand that any alternative to the default each-line-is-a-new-paragraph method must be perfect. It should be enough if an alternative is better. > What my specification essentially modifies is that with this > specification, you at least will have a chance to get the mode right. My experience is that this is an important feature to have, but it will (maybe even should) be used rather rarely. In most cases you will just have plain text. Moreover, emitting the control sequences that set the mode is in itself a complication, because if the terminal doesn't support them, the result could be corrupted display. You will need methods of detecting the support, and those detection methods usually involve sending another control sequence to the terminal and waiting for response, something that complicates applications and causes delays in displaying output. > In case of "zip", the creators of that software know exactly how the > output should look like Not necessarily true. The translations are normally prepared by people who are experts only in translating messages, they don't necessarily consider layout issues, because for that you'd need to look at the code or even run the program, something translators are unlikely to do. > If you're about to internationalize your software, this layout is a > pretty bad choice. Tell me about that! But the reality is that this is what you get, and IMO the solution for displaying this on a terminal should work reasonably well with that. > This kind of formatting also ignores that English is a pretty dense > language, in other languages the strings tend to become longer. Actually, some/many RTL scripts tend to produce shorter text, because vowels are not written, and because many words have very short roots. But this is a tangent.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators
> Date: Tue, 5 Feb 2019 00:05:47 + > From: Richard Wordingham via Unicode > > > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited > > > by paragraph separator characters. This means characters whose bidi > > > category is B, which includes Newline, the CR-LF pair on Windows, > > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR. > > It actually gives two different definitions. Table UAX#9 4 restricts > the type B to *appropriate newline functions; not all newlines are > paragraph separators. For what exactly is "appropriate newline function" one should read the Unicode Standard, section 5.8. My conclusions from that are different from yours; see below. > > Indeed, this was an oversight on my side. So, with this definition, > > every single newline character starts a new paragraph. The result of > > printf "Hello\nWorld\n" > world.txt > > is a text file consisting of two paragraphs, with 5 characters in > > each. Correct? > > No, it depends on when a newline function is 'appropriate'. TUS 5.8 > Rule R2b applies - 'In simple text editors, interpret any NLF the same > as LS'. That's not all of what the Standard says. Just a couple of paragraphs above Rule R2b, there's this text: Note that even if an implementer knows which characters represent NLF on a particular platform, CR, LF, CRLF, and NEL should be treated the same on input and in interpretation. Only on output is it necessary to distinguish between them. So in practice, IMO the above example does constitute 2 paragraphs, regardless of the underlying platform's conventions.
Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)
> From: Egmont Koblinger > Date: Tue, 5 Feb 2019 00:08:10 +0100 > Cc: unicode@unicode.org > > every single newline character starts a new paragraph. The result of > printf "Hello\nWorld\n" > world.txt > is a text file consisting of two paragraphs, with 5 characters in each. > Correct? Yes. > > Actually, Emacs implements the rule that paragraphs are separated by > > empty lines. This is documented in the Emacs manuals. > > That is, Emacs overrides UAX#9 and comes up with a different > definition? Yes, Emacs uses the "higher-level protocols" clause in HL1, when the paragraph direction is to be determined from the text. (There's also a way for the user or a Lisp program to force a certain base paragraph direction on all paragraphs in a window that displays some text.) > Furthermore, you argue that in terminals I should follow > Emacs's definition rather than Unicode's? IME, what Emacs uses gives much better results, yes. > I believe I understand your concerns with the per-line paragraph > definition, but this interpretation that I've just shown most likely > leads to even more broken behavior. I don't see how the result could be more broken, when the decisions about base paragraph direction are made much more rarely. The places in text where the paragraph direction will be determined under my proposal is a small subset of the places where it will be determined by the default UBA rules. So it will make the same mistakes as the each-line-is-a-new-paragraph method, but there will be much fewer of such mistakes. In addition to this theoretical argument, I have 10 years of using this in Emacs to back me up. The only difference between Emacs and your example is the very first paragraph. > It's a really nontrivial technical problem to let the terminal > emulator know where each prompt, and/or each command's output begins > and ends. There's work going on for letting the terminal emulator > recognize the prompts, but even if it's successful, it'll probably > take 5-10 years to reach the majority of the users. And it probably > still wouldn't solve the case of knowing the boundary between the two > outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if > they're concatenated with "cat file1.txt file2.txt". I think you are trying to find a perfect solution, and because it probably doesn't exist, or at least is hard to come by, you conclude that a solution that is imperfect should be rejected. But I'm not saying my proposal is the perfect solution, just that it is better (sometimes, way better) than the default of considering each line a paragraph. > So, what you're arguing for, is that the default behavior should be > something that's: > - currently not implementable in a semantically correct way (to stop > around shell prompts) due to technical limitations, and > - isn't what Unicode says. The first point has to do with the search for a perfect solution. My advice is to settle for something reasonable even if it is not perfect. The second point is incorrect: the UBA explicitly allows the implementation to apply higher-level protocols for paragraph direction, see HL1 in UAX#9. > You have not convinced me that the pros outweigh the cons. There are no cons in my proposal that aren't already present in the default each-line-is-a-new-paragraph rule. So even if the pros don't outweigh the cons, the balance should be better than under the default. > That being said, I'm more than open to see such a behavior as a > future extension, subject of course to the semantic prompt stuff > being available. I think the default should provide reasonably good display, and each-line-is-a-new-paragraph doesn't.
Re: Encoding italic
James Kass wrote: William’s suggestion of floating a proposal for handling italics with VS14 might be an example of the old saying about “putting the cart before the horse”. Well, a proposal just about using VS14 to indicate a request for an italic version of a glyph in plain text, including a suggestion of to which characters it could apply, would test whether such a proposal would be accepted to go into the Document Register for the Unicode Technical Committee to consider or just be deemed out of scope and rejected and not considered by the Unicode Technical Committee. If the proposal were allowed to become included in the Document Register of the Unicode Technical Committee then if other people wish to submit comments and other proposals then that would be possible as it would have become established that such a topic is deemed acceptable for placing into the Document Register of the Unicode Technical Committee. William Overington Tuesday 5 February 2019
Re: Encoding italic
William Overington wrote, > Well, a proposal just about using VS14 to indicate a request for an > italic version of a glyph in plain text, including a suggestion of to > which characters it could apply, would test whether such a proposal > would be accepted to go into the Document Register for the Unicode > Technical Committee to consider or just be deemed out of scope and > rejected and not considered by the Unicode Technical Committee. As long as “italics in plain-text” is considered out-of-scope by Unicode, any proposal for handling italics in plain-text would probably be considered out-of-scope, as well. But I could be wrong and wouldn’t mind seeing a proposal.
Re: Ancient Greek apostrophe marking elision
On Tue, Feb 5, 2019 at 12:23 AM James Kass via Unicode wrote: > Text a man has JOINED together, let not algorithm put asunder. > I was hoping so much that ὃ οὖν ὁ θεὸς συνέζευξεν ἄνθρωπος μὴ χωριζέτω would have an apostrophe but alas no.