RE: Unicode editing

Marco Cimarosti Wed, 11 Apr 2001 08:45:48 -0700
(A note: I am starting too wonder whether this discussion is appropriate on
this mailing list. Or whether we should prepend an "off topic" prefix...)

Roozbeh Pournader [mailto:[EMAIL PROTECTED]] wrote:
> On Wed, 28 Mar 2001, Marco Cimarosti wrote:
> > Storing the level with each character is enough for 
> > generating *one* valid Unicode logical order.
> > [...]
> > But it is *not* enough to recreate *exactly* the same 
> > embedding controls that you had in the original text.
> 
> I know it will be enough for that. I'm just wondering about
> additional > bidi requirements (weak types, etc). We may
> need to have more info to make life easier for the user,
> including outside the buffer information, like "what is
> the user doing currently?", including 'inside a number',
> 'inside a math-like expression', 'just finished an LTR
> embedding', etc. Also, we may need to know which has
> been the character just inserted.

Yes, I agree on this. Especially for typing numbers in RTL text, which is a
particularly troublesome case.

Unluckily, this defeats my naive idea that typed characters must always go
at the cursor position...

But, what do you mean exactly by "math-like expression"?

> You know what I'm trying to do. I want to make it easier for 
> the user at all costs... I think we agree on the basic
> model (visual + emb. levels).

OK. I try to recapitulate my vision of this model, to see if we have the
same thing in mind.

1) The standard bidi algorithm is run to get visual-order line decorated
with embedding levels for each character. (BTW, all embedding and override
controls are stripped off in this phase, in force of rule X9).

2) [OPTIONAL] The embedding levels generated in point 1 are optimized: they
are lowered to the minimum possible values that would not affect the logical
order. [This is optional because of your objection that embeddings levels
which are "redundant" or "non-sense" today might have new unexpected
meanings in future versions of Unicode]

3) The embedding levels are normally invisible to the user, and she doesn't
normally take care of them. However, there is a special editing mode where
the embedding levels may be visualized (as arrows?) and edited.

4) Text selections, cursor movements and mouse hit-testing are done in a
strictly visual way, so they are more or less as easy as in non-bidi
editing.

5) Copying text to the clipboard requires running point 2 (optimizing
levels) on the clipboard text. I.e., the clipboard is treated exactly as if
it was a paragraph.

6) [OPTIONAL] Deleting parts of text requires re-running point 2 (optimizing
levels) on the remaining paragraph.

7) Pasting text requires raising the levels in the pasted block to be equal
or 1 greater than the surrounding text, so that is fits nicely in its new
embedding.

8) [OPTIONAL] Pasting text requires re-running point 2 (optimizing levels)
on the resulting paragraph.

9) The paragraph direction of text being pasted determines the position of
the cursor after pasting. If the clipboard text is RTL, the cursor will be
moved to the LEFT of the new text. If the clipboard is LTR, the cursor will
be moved at the RIGHT of the new text.

10) Characters entered from the keyboard have only two directional types:
RTL or LTR. There are no weak nor neutral types. In other terms, the
characters associated to the keyboard are sorts of "mini paragraphs",
containing only one character, which has already undergone points 1 and 2.
With this trick, keyboard entry and clipboard pasting can be treated the
same way. [I am not sure that you agree with this: there was a doubt related
to the space key on Hebrew keyboards]

11) As keyboard entry is similar to a series of clipboard pastings, and as
there are no neutral types, each entered character is always inserted at the
cursor's position. An important consequence of this identification between
pasting and typing is that also point 9 (where the cursor goes after
inserting) also holds true for keyboard entry. BUT this idea is clearly too
naive, as you explained above.

12) There are special RTL and LTR override commands. These commands have two
modes, similarly to the bold or italic functions in word processors. If they
are given on a selected block, the embedding level of that block is reset to
an odd or even level which is equal to or 1 greater than the surrounding
text. If they are given with no selected block, the command is executed on
the "mini paragraphs" of the keyboard, so that all characters entered from
that moment on will have a specified direction.

13) When the editing session ends (or, during editing, when the text is
needed back in logical order for some non-visual operation), a "Reverse Bidi
Algorithm" is run to restore a proper Unicode string. This is clearly one of
the vulnerable points of all our discussion, because it is not so easy to
come up with a good algorithm to do this. It is straightforward to reverse
rules P* and L* (reordering levels and paragraph direction), but reversing
rules X* and I* (explicit and implicit embeddings) is more complicated, and
reversing rules W* and N* (weak and neutral types) is really quite tricky
for me...

Now, I am very aficionado to points 9 and 11 (where the cursor goes after
inserting). But, clearly, I have been oversimplifying one more time :-(

One effect of these rules is that, when (LTR) numbers are typed in a RTL
context, the user has to move the cursor after each number, in order to
bring the cursor on the left of the number, before she can type the next RTL
word.

Similarly, in either RTL or LTR, when a word of "opposite" directionality is
entered the user must move the cursor back to a proper position before start
typing the next "straight" word.

I was wondering whether, perhaps, the general model above could be left
intact, and this problem could be addressed with a sort of "patch"?

What I have in mind is an optional function to automatically detect these
cases and "move the cursor back" on behalf of the user.

As you said, this requires keeping track of the user's activity. What I am
hoping is that this status information could be limited to the
directionality of the last entered character.

This is a first draft of how this patch function might work:

.       variable:       smartBidiMode   (boolean; if false, the user prefers
the original model)
.       variable:       lastKeyDir              (it can be "LTR", "RTL" or
"N/A")
.       variable:       currKeyDir              (it can be "LTR", "RTL" or
"N/A")
.       variable:       paragraphDir    (it can be "LTR" or "RTL")
.       set lastKeyDir = "N/A"
.       for each typed key:
.       .       if it is a graphic key then:
.       .       .       set currKeyDir = the direction on the character that
will be entered
.       .       else:
.       .       .       set currKeyDir = "N/A"
.       .       end if
.       .       if smartBidiMode and currKeyDir = paragraphDir and
currKeyDir <> lastKeyDir then:
.       .       .       if paragraphDir = "LTR" then:
.       .       .       .       while there is a char on the right of the
cursor and it is "RTL":
.       .       .       .       .       move one position to the right
.       .       .       .       end while
.       .       .       else if paragraphDir = "RTL" then:
.       .       .       .       while there is a char on the left of the
cursor and it is "LTR":
.       .       .       .       .       move one position to the left
.       .       .       .       end while
.       .       .       end if
.       .       end if
.       .       set lastKeyDir = currKeyDir
.       .       go on as usual... (insert the character or perform the
function key)
.       end for

> > For instance, imagine that the original text contained a 
> > stand-alone <PDF>. [...]
> It's OK. I'm almost sure that we should start working on a cannonical
> equivalence thing for Bidi.

I have already mentioned rule X9 in the bidi algorithm ("Remove all RLE,
LRE, RLO, LRO, PDF, and BN codes"): isn't it enough for getting rid of such
weird cases and being happily compliant?

_ Marco
RE: Unicode editing

Reply via email to