Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Thu, 7 Feb 2019 22:35:23 +
> From: Richard Wordingham via Unicode 
> 
> > > Do you mean you aim to maintain a regex that matches everyone's
> > > prompt in the world, without a significant amount of false positive
> > > matches on non-prompt lines?  
> 
> > Yes.
> 
> Wow!  You'll do well to match a prompt such as '2p ', which I used for
> a while.

Like I said: for any reasonable prompt that doesn't match, you can
report a bug, and have the Emacs maintainers deliberate whether your
case is important enough to be supported by default.  Failing that,
you can set the regexp to a suitable value in a mode hook defined on
your init file.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Richard Wordingham via Unicode
On Thu, 07 Feb 2019 22:00:20 +0200
Eli Zaretskii via Unicode  wrote:

> > From: Egmont Koblinger 
> > Date: Thu, 7 Feb 2019 19:01:33 +0100

> > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:

> > > No, it needs no interaction.  Unless the regexp doesn't work for
> > > you, which you should then report as a bug in Emacs.  

> > Do you mean you aim to maintain a regex that matches everyone's
> > prompt in the world, without a significant amount of false positive
> > matches on non-prompt lines?  

> Yes.

Wow!  You'll do well to match a prompt such as '2p ', which I used for
a while.

Richard.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Thu, 7 Feb 2019 19:01:33 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:
> 
> > No, it needs no interaction.  Unless the regexp doesn't work for you,
> > which you should then report as a bug in Emacs.
> 
> Do you mean you aim to maintain a regex that matches everyone's prompt
> in the world, without a significant amount of false positive matches
> on non-prompt lines?

Yes.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:

> No, it needs no interaction.  Unless the regexp doesn't work for you,
> which you should then report as a bug in Emacs.

Do you mean you aim to maintain a regex that matches everyone's prompt
in the world, without a significant amount of false positive matches
on non-prompt lines?

(It's getting damn off-topic though.)


e.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Thu, 7 Feb 2019 18:20:02 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> > It uses a regular expression, see term-prompt-regexp.
> 
> So, it's not automatic, needs user interaction

No, it needs no interaction.  Unless the regexp doesn't work for you,
which you should then report as a bug in Emacs.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
Hi,

On Thu, Feb 7, 2019 at 3:27 PM Eli Zaretskii  wrote:

> It uses a regular expression, see term-prompt-regexp.

So, it's not automatic, needs user interaction, and for that reason,
may not have worked for me. (I have other weird things in my prompt,
like 256-color sequences that Emacs didn't recognize, perhaps this
made the regexp matching fail. Nevermind.)

> > Whatever it does to know where the prompt is, can it be made into a
> > standard, cross-terminal feature?
>
> Not sure.  It's a kind of heuristic, which is why the regexp is
> customizable on user level, so that users could adapt it to their
> needs, should that be necessary.

iTerm2 has a "shell integration" where the prompt contains explicit
markers so that no heuristics or user configuration is needed from the
terminal. We're trying to somewhat standardize it at
https://gitlab.freedesktop.org/terminal-wg/specifications/issues/4 and
get more terminals support it. Not sure where this attempt will take
us, we'll see.

> In what version of Emacs is that?  In the latest version 26 I have
> here, the tutorial displays with most paragraphs in RTL direction.

25.2 here, it might have obviously changed for a newer version, glad to hear it.

My distro will upgrade in about 2 months. Since I'm not an Emacs user
myself, I hope you don't mind if I don't make extra rounds in
upgrading now to verify this.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Thu, 7 Feb 2019 00:45:55 +0100
> Cc: unicode Unicode Discussion 
> From: Egmont Koblinger via Unicode 
> 
> > Not necessarily.  One could allow the first strong character in the
> > prompt to determine the paragraph directions
> 
> How does Emacs know what's a prompt? How can it tell it from the
> previous and next command's output?

It uses a regular expression, see term-prompt-regexp.

> Whatever it does to know where the prompt is, can it be made into a
> standard, cross-terminal feature?

Not sure.  It's a kind of heuristic, which is why the regexp is
customizable on user level, so that users could adapt it to their
needs, should that be necessary.

> > That's what the Emacs
> > terminal (invoked by M-x term; top level definition in term.el) does.
> 
> I tried it. Executed my default shell, and inside that, a "cat
> TUTORIAL.he". All the paragraphs are rendered as LTR ones,
> left-aligned. Not the way the file is opened in Emacs.

In what version of Emacs is that?  In the latest version 26 I have
here, the tutorial displays with most paragraphs in RTL direction.

> If you claim Emacs's built-in terminal emulator supports BiDi, I'm
> kindly asking you to present a documentation of its behavior, in
> similar spirit to my BiDi proposal.

The Emacs terminal emulator displays text as any other text in any
other Emacs buffer, so it supports the same bidi reordering as
elsewhere.  You could make it emulate other terminals by setting the
variable bidi-paragraph-direction to either left-to-right or
right-to-left, then all the paragraphs will have the base direction
you specify.  But the default value of this variable in term buffers
is nil, which invokes dynamic determination of base paragraph
direction.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Wed, 6 Feb 2019 23:32:43 +
> From: Richard Wordingham via Unicode 
> 
> > You define paragraphs as emptyline-separated blocks on which you
> > perform autodetection of the paragraph direction. This is great! As
> > I've mentioned, I'd love to have such a mode in terminals, but it's
> > subject to underlying improvements, like knowing when a prompt starts
> > and ends, because prompts also have to be paragraph delimiters.
> 
> Not necessarily.  One could allow the first strong character in the
> prompt to determine the paragraph directions.  That's what the Emacs
> terminal (invoked by M-x term; top level definition in term.el) does.

Emacs's built-in terminal emulator does that only because no one
bothered to do something about this behavior.  I personally don't
consider this the correct behavior (but then I don't use M-x term in
Emacs except for testing).  Emacs does know where the prompt is, so it
could implement the rule that whatever follows the prompt starts a new
paragraph.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Richard Wordingham via Unicode
On Thu, 7 Feb 2019 00:45:55 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> > Not necessarily.  One could allow the first strong character in the
> > prompt to determine the paragraph directions  
> 
> How does Emacs know what's a prompt? How can it tell it from the
> previous and next command's output?

I don't believe the Emacs terminal does either.  What's special about
the prompt is that it starts a line, so most paragraphs start with a
prompt.  Not all prompts contain a strong character.  To let a file's
contents control directionality, instead of issuing the command 'cat
file1' one would have to issue a shell command '(echo; cat file1)' or
similar to terminate the paragraph containing the prompt.  The 'echo'
inserts an empty line.

> > That's what the Emacs
> > terminal (invoked by M-x term; top level definition in term.el)
> > does.  
> 
> I tried it. Executed my default shell, and inside that, a "cat
> TUTORIAL.he". All the paragraphs are rendered as LTR ones,
> left-aligned. Not the way the file is opened in Emacs.

See above.  I don't know how what your shell is.

> If you claim Emacs's built-in terminal emulator supports BiDi, I'm
> kindly asking you to present a documentation of its behavior, in
> similar spirit to my BiDi proposal.

I've a feeling it has emergent behaviour, and may require a lot of
experimentation to elucidate.

> Does this logic also apply to single newline characters? If not, why
> not, what's the conceptual difference? If it does, why do text files
> end in a newline?

I don't like the convention that removing the newline from the end of a
non-empty line changes it into a binary file.  The short answer is that
some editors allow a text file not to have a final newline; such files
are not handled well in the Unix environment.

Some things are just untidy messes.  Compare C, where a semicolon
*terminates* statements, but some are terminated by '}', and a
semicolon *separates* the expression within the control part of a for
statement, and a comma *separates* the constant definitions in an enum
declaration - for a long time, a trailing comma inside the braces was
illegal.

Richard. 


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Richard,

> Not necessarily.  One could allow the first strong character in the
> prompt to determine the paragraph directions

How does Emacs know what's a prompt? How can it tell it from the
previous and next command's output?

Whatever it does to know where the prompt is, can it be made into a
standard, cross-terminal feature?

> That's what the Emacs
> terminal (invoked by M-x term; top level definition in term.el) does.

I tried it. Executed my default shell, and inside that, a "cat
TUTORIAL.he". All the paragraphs are rendered as LTR ones,
left-aligned. Not the way the file is opened in Emacs.

If you claim Emacs's built-in terminal emulator supports BiDi, I'm
kindly asking you to present a documentation of its behavior, in
similar spirit to my BiDi proposal.

> Not necessarily.  One might use cat to glue together files that had
> split into 1400k chunks, in which case it is not even reasonable to
> expect the end of file to be at a character boundary.  (Yes, floppy
> disks still have their uses.)

I did not say anything about changing cat's behavior. I recommended to
change the convention for such paragraph-oriented text files to end
with two newlines.

> But the white space between paragraphs is a separator, not a
> terminator.  One doesn't require it at the end when formatting
> paragraphs within the cell of a table.

Does this logic also apply to single newline characters? If not, why
not, what's the conceptual difference? If it does, why do text files
end in a newline?


e.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Richard Wordingham via Unicode
On Wed, 6 Feb 2019 22:01:59 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> (I'm getting lost where to reply, and how the subject gets mangled and
> the thread split into different ones.)
> 
> 
> I've thought about it a lot, experimented with Emacs's behavior, and
> I've arrived at the conclusion that we are actually much closer to
> each other than I had thought. Probably there's a lot of
> misunderstanding due to different terminology we used.
> 
> I've set my terminal to RTL paragraph direction (via the relevant
> escape sequence), then did a "cat TUTORIAL.he" (the file taken from
> 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
> one, and the one running in a terminal of no BiDi.
> 
> Apart from a few minor irrelevant differences, they look the same!
> Hooray!!!
> 
> (The differences are:
> 
> - I had to slightly modify TUTORIAL.he to make sure none of the lines
> start with a BiDi control (I added a preceding character) because
> currently VTE doesn't support them, there's no character cell to store
> this data. This definitely needs to be fixed in the second version of
> my proposal.
> 
> - Emacs running in a terminal shows an underscore wherever there's a
> BiDi control in the source file – while the graphical one doesn't.
> This looks like a simple bug to me, right?
> 
> - Line 1007, the copyright line of this file uses visual indentation,
> and Emacs detects LTR paragraph for that line. I think it should
> rather use BiDi controls to have an overall RTL paragraph direction
> detected, and within that BiDi controls to force LTR for the text. The
> terminal shows it with RTL direction, as I manually set it.
> 
> Again, all these three details are irrelevant to my point, namely that
> in WIP gnome-terminal it looks the same as in Emacs.)
> 
> 
> You define paragraphs as emptyline-separated blocks on which you
> perform autodetection of the paragraph direction. This is great! As
> I've mentioned, I'd love to have such a mode in terminals, but it's
> subject to underlying improvements, like knowing when a prompt starts
> and ends, because prompts also have to be paragraph delimiters.

Not necessarily.  One could allow the first strong character in the
prompt to determine the paragraph directions.  That's what the Emacs
terminal (invoked by M-x term; top level definition in term.el) does.

> On a nitpicking side note:
> 
> It's damn ugly not to terminate a text file with a newline. Newline is
> much better thought of a "terminator" than a "delimiter". For example,
> if you do a "cat file1 file2", you expect file2 to start on its own
> line.

Not necessarily.  One might use cat to glue together files that had
split into 1400k chunks, in which case it is not even reasonable to
expect the end of file to be at a character boundary.  (Yes, floppy
disks still have their uses.)

> Shouldn't this apply to paragraphs, too, especially when BiDi is in
> the game? I'd argue that an empty line (double newline) shouldn't be a
> delimiter, it should be a terminator for a paragraph.

But the white space between paragraphs is a separator, not a
terminator.  One doesn't require it at the end when formatting
paragraphs within the cell of a table. 

Richard.



Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi,

I was loose with my terminology once again, which is not a wise thing
when you're trying to clarify misunderstandings :)

> But once you have
> decided on a direction, each _line_ within that data is passed
> separately to the BiDi algorithm to get reshuffled; this is what Emacs
> does, this is what my specification says, and this is the right thing.
> That is, for this step, the definition of "paragraph", as the BiDi
> algorithm uses this term, is a line of the text file.

I keep thinking of the BiDi algorithm as one that takes a single
paragraph, because that's how I use it in VTE. But in fact, the BiDi
algorithm starts by splitting into paragraphs. I keep forgetting about
this outermost "for loop" of the BiDi algo.

And with that, proper definition, you can of course pass the entire
emptyline-delimited segment into the BiDi algorithm in a single step.
In its first phase, the BiDi algorithm will split it at newlines,
because for the BiDi algorithm (but not when detecting the paragraph
direction in Emacs), newline is the paragraph delimiter. Then it will
execute the rest of the algorithm for each paragraph (that is: line)
separately.

This is exactly the same as splitting manually, and then for each line
invoking the BiDi algorithm.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Eli,

(I'm getting lost where to reply, and how the subject gets mangled and
the thread split into different ones.)


I've thought about it a lot, experimented with Emacs's behavior, and
I've arrived at the conclusion that we are actually much closer to
each other than I had thought. Probably there's a lot of
misunderstanding due to different terminology we used.

I've set my terminal to RTL paragraph direction (via the relevant
escape sequence), then did a "cat TUTORIAL.he" (the file taken from
26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
one, and the one running in a terminal of no BiDi.

Apart from a few minor irrelevant differences, they look the same! Hooray!!!

(The differences are:

- I had to slightly modify TUTORIAL.he to make sure none of the lines
start with a BiDi control (I added a preceding character) because
currently VTE doesn't support them, there's no character cell to store
this data. This definitely needs to be fixed in the second version of
my proposal.

- Emacs running in a terminal shows an underscore wherever there's a
BiDi control in the source file – while the graphical one doesn't.
This looks like a simple bug to me, right?

- Line 1007, the copyright line of this file uses visual indentation,
and Emacs detects LTR paragraph for that line. I think it should
rather use BiDi controls to have an overall RTL paragraph direction
detected, and within that BiDi controls to force LTR for the text. The
terminal shows it with RTL direction, as I manually set it.

Again, all these three details are irrelevant to my point, namely that
in WIP gnome-terminal it looks the same as in Emacs.)


You define paragraphs as emptyline-separated blocks on which you
perform autodetection of the paragraph direction. This is great! As
I've mentioned, I'd love to have such a mode in terminals, but it's
subject to underlying improvements, like knowing when a prompt starts
and ends, because prompts also have to be paragraph delimiters. You
convinced me that it's much more important than I thought, thanks a
lot for that! I will try to see if I can push for addressing the
prerequisite issues sooner. Indeed I had to manually set RTL paragraph
direction; with manual LTR or with per-line autodetection (as VTE can
do now) the result would be much worse.


Here's how the story continues from here. Here is where we
misunderstood each other (or at the very least I misunderstood you),
although we are talking about the same, doing things the same way:

The BiDi algorithm takes a paragraph of text at a time, and somehow
reshuffles its letters. UAX#9 section 3 starts by saying that the
first main phase is separation into "paragraphs". What are those
"paragraphs" that we're takling about _now_?

The thing is, both in Emacs as well as in my specification, it's a
logical line of the text (that is: delimited by single newlines). No,
in these steps, when UBA is run, the paragraph is no longer defined as
emptyline-delimited segments, it's defined as lines of the text.

To recap: The _paragraph direction_ is determined in Emacs for
emptyline-delimited segments of data, which I honestly find a great
thing, and would love to do in terminals too, alas at this point it's
blocked by some really nontrivial technical issues. But once you have
decided on a direction, each _line_ within that data is passed
separately to the BiDi algorithm to get reshuffled; this is what Emacs
does, this is what my specification says, and this is the right thing.
That is, for this step, the definition of "paragraph", as the BiDi
algorithm uses this term, is a line of the text file. This is where I
thought we had a disagreement, but we don't, we just misunderstood
each other.

-

On a nitpicking side note:

It's damn ugly not to terminate a text file with a newline. Newline is
much better thought of a "terminator" than a "delimiter". For example,
if you do a "cat file1 file2", you expect file2 to start on its own
line.

Shouldn't this apply to paragraphs, too, especially when BiDi is in
the game? I'd argue that an empty line (double newline) shouldn't be a
delimiter, it should be a terminator for a paragraph. I think "cat
file1 file2" should make sure that the last paragraph of file1 and the
first paragraph of file2 are printed as separate paragraphs
(potentially with different paragraph direction), shouldn't it? I'd
argue that if a text file is formatted like TUTORIAL.he, with empty
lines denoting paragraph boundaries, then it should also end in an
empty line (that is: two newline characters).

-

Feel free to skip the rest :)

Let's make a thought experiment. Let's assume that for running the
BiDi algorithm, we'd still stick to the emptyline-delimited paragraph
definition. This is not what you do, this is not what I do, but I
misunderstood that this is what you did, and I also thought this was a
good idea as a potential extension for the BiDi specs – I no longer
think so. This definition is truly problematic, as I'll 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> Date: Tue, 5 Feb 2019 00:05:47 +
> From: Richard Wordingham via Unicode 
> 
> > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > > by paragraph separator characters. This means characters whose bidi
> > > category is B, which includes Newline, the CR-LF pair on Windows,
> > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.
> 
> It actually gives two different definitions. Table UAX#9 4 restricts
> the type B to *appropriate newline functions; not all newlines are
> paragraph separators.

For what exactly is "appropriate newline function" one should read the
Unicode Standard, section 5.8.  My conclusions from that are different
from yours; see below.

> > Indeed, this was an oversight on my side. So, with this definition,
> > every single newline character starts a new paragraph. The result of
> > printf "Hello\nWorld\n" > world.txt
> > is a text file consisting of two paragraphs, with 5 characters in
> > each. Correct?
> 
> No, it depends on when a newline function is 'appropriate'. TUS 5.8
> Rule R2b applies - 'In simple text editors, interpret any NLF the same
> as LS'.

That's not all of what the Standard says.  Just a couple of paragraphs
above Rule R2b, there's this text:

  Note that even if an implementer knows which characters represent
  NLF on a particular platform, CR, LF, CRLF, and NEL should be
  treated the same on input and in interpretation. Only on output is
  it necessary to distinguish between them.

So in practice, IMO the above example does constitute 2 paragraphs,
regardless of the underlying platform's conventions.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 00:08:10 +0100
> Cc: unicode@unicode.org
> 
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in each. 
> Correct?

Yes.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines. This is documented in the Emacs manuals.
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition?

Yes, Emacs uses the "higher-level protocols" clause in HL1, when the
paragraph direction is to be determined from the text.  (There's also
a way for the user or a Lisp program to force a certain base paragraph
direction on all paragraphs in a window that displays some text.)

> Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's?

IME, what Emacs uses gives much better results, yes.

> I believe I understand your concerns with the per-line paragraph
> definition, but this interpretation that I've just shown most likely
> leads to even more broken behavior.

I don't see how the result could be more broken, when the decisions
about base paragraph direction are made much more rarely.  The places
in text where the paragraph direction will be determined under my
proposal is a small subset of the places where it will be determined
by the default UBA rules.  So it will make the same mistakes as the
each-line-is-a-new-paragraph method, but there will be much fewer of
such mistakes.

In addition to this theoretical argument, I have 10 years of using
this in Emacs to back me up.  The only difference between Emacs and
your example is the very first paragraph.

> It's a really nontrivial technical problem to let the terminal
> emulator know where each prompt, and/or each command's output begins
> and ends. There's work going on for letting the terminal emulator
> recognize the prompts, but even if it's successful, it'll probably
> take 5-10 years to reach the majority of the users. And it probably
> still wouldn't solve the case of knowing the boundary between the two
> outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
> they're concatenated with "cat file1.txt file2.txt".

I think you are trying to find a perfect solution, and because it
probably doesn't exist, or at least is hard to come by, you conclude
that a solution that is imperfect should be rejected.

But I'm not saying my proposal is the perfect solution, just that it
is better (sometimes, way better) than the default of considering each
line a paragraph.

> So, what you're arguing for, is that the default behavior should be
> something that's:
> - currently not implementable in a semantically correct way (to stop
> around shell prompts) due to technical limitations, and
> - isn't what Unicode says.

The first point has to do with the search for a perfect solution.  My
advice is to settle for something reasonable even if it is not
perfect.

The second point is incorrect: the UBA explicitly allows the
implementation to apply higher-level protocols for paragraph
direction, see HL1 in UAX#9.

> You have not convinced me that the pros outweigh the cons.

There are no cons in my proposal that aren't already present in the
default each-line-is-a-new-paragraph rule.  So even if the pros don't
outweigh the cons, the balance should be better than under the default.

> That being said, I'm more than open to see such a behavior as a
> future extension, subject of course to the semantic prompt stuff
> being available.

I think the default should provide reasonably good display, and
each-line-is-a-new-paragraph doesn't.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> IME, this is a grave mistake.  I hope I explained why; it is now up to
> you to decide what to do about that.

Let me share one more thought.

I have to admit, I'm not an Emacs user, I only have some vague ideas
how powerful a tool it is. But in its very core I still believe it's a
text editor – is it fair to say this? It could be used for example to
conveniently create TUTORIAL.he.

I'm not aware of all the kinds of works you can do in Emacs, but I
have a feeling that the kind of work you do in a terminal emulator is
potentially more diverse. (Let's not nitpick that a terminal can run
emacs and emacs has a terminal inside so mathematically speaking it's
all the same...)

"cat TUTORIAL.he" is indeed one of the commands you can execute in a
terminal, and unfortunately, given what terminals currently understand
from their contents, I just cannot make it display as you would prefer
(and I agree would make a lot of sense). But it's just one use case.

There are plenty of line-oriented tools.

Think of "head" and "tail". They operate on lines of files, which end
up being paragraphs in the terminal according to my definition.
According to your definition, they could cut a paragraph in half, they
could render differently than as if the entire file was printed.
According to my definition, you'll always get the same visual
repsesentation, just on the given fragment of the file.

Think of "grep", possibly combined with "-r" to process files
recursively, and "-C" to print context lines. Not only it can cut
paragraphs (of your definition) in half when it displays the matching
line (plus context), but also how would you locate in its output when
it switches from one match's context to the next match's context
within the same file, or to a match in another file? How would you
define a paragraph, and how would you define the bigger unit on which
the paragraph direction is guessed? I think it's again a use case
where my definition of paragraph is less problematic than yours.

Think of ad-hoc shell scripts that use "echo"/"printf" to inform the
user, "read" to read data etc. Or utilities written in C or whatever
that don't care about terminals at all, just print output. In these
cases there's no one formatting / wrapping at 80 columns performed by
the app. A logical segment is typically printed as a single line,
which will be wrapped by the terminal if doesn't fit in the current
width (and in some terminals rewrapped when the terminal is resized),
this matches my definition of paragraph. There's rarely an empty line
injected in these cases; if there is, it is most likely to separate
some even bigger semantical units.

There are just sooo many use cases, it's impossible to perfectly
address all of them at once. "cat TUTORIAL.he" is just one of them,
not necessarily the most typical, not necessarily the one that should
drive the BiDi design.

Let's note that the four "BiDi-aware" terminals that I could test all
define paragraphs as lines – I mean visual lines on their own canvas.
If the terminal is 80 characters wide, and a utility prints a line of
100 characters, it'll obviously wrap into 80+20 characters. And then
these terminals treat them as two separate paragraphs, one with 80
characters and one with 20, and run BiDi separately on them. I'm
confident that my specification which says that it should be preserved
as a 100 character long paragraph and passed to BiDi accordingly is
already a significant step forward.


cheers,
egmont



Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Tue, 5 Feb 2019 00:08:10 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > by paragraph separator characters.  This means characters whose bidi
> > category is B, which includes Newline, the CR-LF pair on Windows,
> > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.  

It actually gives two different definitions.  Table UAX#9 4 restricts
the type B to *appropriate newline functions; not all newlines are
paragraph separators.

> Indeed, this was an oversight on my side. So, with this definition,
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in
> each. Correct?

No, it depends on when a newline function is 'appropriate'.  TUS 5.8
Rule R2b applies - 'In simple text editors, interpret any NLF the same
as LS'.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines.  This is documented in the Emacs manuals.  
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition? Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's? Or please clarify if I
> misunderstood you here.

He's deriving 'B' from a protocol.

Richard.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
> paragraph separator characters.  This means characters whose bidi
> category is B, which includes Newline, the CR-LF pair on Windows,
> U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

Indeed, this was an oversight on my side. So, with this definition,
every single newline character starts a new paragraph. The result of
printf "Hello\nWorld\n" > world.txt
is a text file consisting of two paragraphs, with 5 characters in each. Correct?

> Actually, Emacs implements the rule that paragraphs are separated by
> empty lines.  This is documented in the Emacs manuals.

That is, Emacs overrides UAX#9 and comes up with a different
definition? Furthermore, you argue that in terminals I should follow
Emacs's definition rather than Unicode's? Or please clarify if I
misunderstood you here.

> > while Emacs itself is a viewer that treats runs between single
> > newlines as paragraphs. That is, Emacs is inconsistent with itself.
>
> Incorrect.  Emacs always treats a run of text between empty lines as a
> single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
> special about TUTORIAL.he, it is just a plain text file with a few
> dozen of bidi formatting controls, needed to show the key sequences
> with weak and neutral characters in correct visual order.  [...]

Thanks for the clarification, I believe it's clear to me now.

> At least with Emacs, it is not the same.  I think considering each
> line as a separate paragraph makes writing bidi plain-text documents
> that look right almost impossible, if each line ends in a newline [...]

> My personal recommendation is to adopt theempty line rule.  It's
> simple enough and gives good results IME. [...]

> I'm surprised that you describe this as such a complex problem.  I
> think you explained up-thread that terminal emulators should cope with
> lines of text arriving piecemeal, which I interpreted as meaning that
> text is stored in the emulator's memory.  Modern emulators running on
> windowed desktops also provide scroll-back buffers, and react to
> expose events.  So I think the text that is currently in the viewport,
> and also some text previously shown, are stored in memory, and can be
> consulted.

The problem is not the memory management.

Let's look at the following session:

---snip---
prompt$ cat file1.txt
This is the
first human-perceived paragraph.

And this is the
second.
prompt$ cat file2.txt
Here this is the
third paragraph.

And this one is
the fourth.
prompt$
---snip---

If you load the files to Emacs, it is perfectly aware of the contents
of the two files. It can define paragraphs however it wants to, and
BiDi the files accordingly.

The terminal emulator doesn't know what's a shell prompt, what's a
command that the user types, what's the output of that command. (You
don't know either from this snippet. Maybe I only cat'ed file1.txt,
and "prompt$ cat file2.txt" is just the sixth line of this eleven-line
file.)

In the terminal emulator's eyes, with Emacs's definition (empty line
delimited), this is one paragraph:

prompt$ cat file1.txt
This is the
first human-perceived paragraph.

and this is another paragraph:

And this is the
second
prompt$ cat file2.txt
Here this is the
third paragraph.

and similarly for the third one.

I believe I understand your concerns with the per-line paragraph
definition, but this interpretation that I've just shown most likely
leads to even more broken behavior.

It's a really nontrivial technical problem to let the terminal
emulator know where each prompt, and/or each command's output begins
and ends. There's work going on for letting the terminal emulator
recognize the prompts, but even if it's successful, it'll probably
take 5-10 years to reach the majority of the users. And it probably
still wouldn't solve the case of knowing the boundary between the two
outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
they're concatenated with "cat file1.txt file2.txt".

So, what you're arguing for, is that the default behavior should be
something that's:
- currently not implementable in a semantically correct way (to stop
around shell prompts) due to technical limitations, and
- isn't what Unicode says.

You have not convinced me that the pros outweigh the cons. That being
said, I'm more than open to see such a behavior as a future extension,
subject of course to the semantic prompt stuff being available.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Mon, 4 Feb 2019 00:36:23 +0100
> Cc: unicode@unicode.org
> 
> The Unicode BiDi algorithm states that it operates on paragraphs of
> text, and leaves it up to a higher protocol to define what a paragraph
> exactly is.
> 
> What's the definition of "paragraph" in the context of plain text files?
> 
> I don't think there's a single well-established practice.

Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
paragraph separator characters.  This means characters whose bidi
category is B, which includes Newline, the CR-LF pair on Windows,
U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

> In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way
> more complicated, probably there isn't a well-defined grammar for
> how exactly bullet list entries and alike should become new
> paragraphs.

Actually, Emacs implements the rule that paragraphs are separated by
empty lines.  This is documented in the Emacs manuals.  (That's by
default, users and Lisp programs can control that to some extent.)
This rule is global, and applied to any file or buffer, including
TUTORIAL.he.

> lorem ipsum FED ]> CBA foobar
> 
> The visual representation, in a narrower viewport, might wrap for
> example like this:
> 
> lorem ipsum CBA
> FED ]> foobar

I suggest to leave line wrapping alone for the moment: it is a further
complication.  Let's first talk about text whose every line ends in a
hard newline -- this is what you see in most "simple" text-mode
utilities which we are talking about.  If/when we solve the problems
there, we can then look at the issues with wrapping.

> Here comes the twist. Let's view this latter file with a viewer that
> uses a _different_ definition for paragraph. Let's view it in Gedit,
> Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where
> every newline begins a new paragraph – that's how these viewers define
> the notion of "paragraph" for the sake of BiDi.
> 
> The visual layout in these viewers becomes:
> 
> lorem ipsum CBA
> <[ FED foobar
> 
> which is just not correct. Since here BiDi is run on the two lines
> separately, the initial "<[" is treated as LTR, placed at the wrong
> location in the wrong order, and the glyphs aren't mirrored.

This kind of problems happens all the time, and you cannot avoid it.
Different programs display bidi text differently.  I propose not to
try to solve this problem, because IME it cannot be solved in general.
Let's focus on the terminal emulators that should comply with your
guidelines, and let's try to decide what should they do about base
paragraph direction of text emitted by "simple" text utilities.
If they all make decisions by the same rule, they all will show the
same text identically.

> Now, Emacs ships a TUTORIAL.he which, for most of its contents (but
> not everywhere) seems to treat runs between empty lines as paragraphs,

Correct.

> while Emacs itself is a viewer that treats runs between single
> newlines as paragraphs. That is, Emacs is inconsistent with itself.

Incorrect.  Emacs always treats a run of text between empty lines as a
single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
special about TUTORIAL.he, it is just a plain text file with a few
dozen of bidi formatting controls, needed to show the key sequences
with weak and neutral characters in correct visual order.  (Some of
those controls can probably be removed nowadays, since we now have the
BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was
released.)  In fact, I wrote that tutorial as an exercise, to prove to
myself that Emacs can be useful for editing non-trivial bidi text.

> In case you think I got something wrong with Emacs: Could you please
> give exact definitions:
> - What are the exact units (so-called "paragraphs" by UAX9) that it
> runs BiDi on when it loads and displays a file?

See above: for the purpose of the Emacs UBA implementation, paragraphs
are separated by empty lines.  That is the only rule in EMacs
regarding paragraph determination.

> - What are the exact units (so-called "paragraphs" by UAX9) in
> TUTORIAL.he on which BiDi needs to be run in order to get the desired
> readable version?

The same.  There's nothing special about that file.

> What most likely happens is that in order to see a difference, you'd
> need to have more special symbols, or at least a more special
> constellation of them. Probably TUTORIAL.he is just luckily simple
> enough that such a difference isn't hit.

No, TUTORIAL.he is neither "lucky" nor "simple".  I deliberately used
there almost every bidi formatting control there is, where
appropriate, to make sure this stiff works as intended in an otherwise
plain text file.

> Another possibility is (and I cannot check because I can't speak
> Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to
> get the desired visual one.

There's no cheating there, I assure you.

> This definition of paragraph (stuff between a newline and