Re: Acquiring DIS 10646

2015-10-04 Thread Sean Leonard

On 10/3/2015 12:28 PM, Asmus Freytag (t) wrote:

On 10/3/2015 8:15 AM, Sean Leonard wrote:

Thanks.

Well, "DIS 10646" is the Draft International Standard, particularly 
Draft 1, from ~1990 or ~1991. (Sometimes it might have been called 
10646.1.) Therefore it would likely only be in print form (or printed 
and scanned form). It's pretty old. What I understand is that Draft 1 
got shot down because it was at variance with the nascent Unicode 
effort; Draft 2 was eventually adopted as ISO 10646:1993, and is 
equivalent to Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = 
Unicode 2.0.)


Sean,

you never explained your specific interest in this matter. Personal 
curiosity? An attempt to write the definite history of character encoding?


A long time ago, in a galaxy far, far away

(Okay it really was not that long ago, and it was pretty close at hand 
since it was on this list)


I proposed adding C1 Control Pictures to Unicode. 
 I am 
resurrecting that effort, but more slowly this time, with more research 
and input from implementers. The requirement is that all glyphs for 
U+ - U+00FF be graphically distinct.


Debuggers used to do this by referencing the graphemes in the hardware 
code page, such as Code Page 437, but we have come a long way from 1981, 
so displaying ♣ for 0x05 does not make much modern sense. Merely 
substituting one of the other legacy code pages in for 0x80 - 0x9F does 
not make sense either. The characters of Code Page 437 overlap with 
U+00A0 - U+00FF in that range, for example. (Windows-1252 is somewhat 
more defensible, but Windows-1252 has 5 unassigned code points so it 
would be incomplete.)


Sean




Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
In the absence of a specific tailoring, is the combination of a lone
surrogate and a combining mark a user-perceived character?  Does a lone
surrogate constitute a user-perceived character?

The problem I have is that because of an application-specific bug,
when I attempt to enter the sequence , I appear to be gettig the UTF-16 code
unit sequence , which is being interpreted as
the codepoint sequence .

(The problem seems to arise because I use a sequence of two key strokes
to enter candrabindu, and the application or input mechanism has to undo
the entry of a supplementary character entered in response to the first
keystroke.  I've reported the problem as Bug 94753.)

Because the lone surrogate is interpreted as the start of a
user-perceived character, I can move the cursor to between U+1148F and
U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
key) will delete the U+D805.  However, if the lone surrogate plus
combining mark is a user-perceived character, then all I will be left
with is .  At present the offending application is treating
Tirhuta combining marks as user-perceived characters, but I suspect the
application has simply not caught up with Unicode Version 7 yet.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Mark Davis ☕️
When I use http://unicode.org/cldr/utility/breaks.jsp, it does show the
sequence ᒏ�ᒺ as just two grapheme clusters.

In #29 we are specifically not concerned about ill-formed text (or other
degenerate cases). I suppose it would be possible to handle isolated
surrogates in different way (eg always breaking) if it represented a common
problem, but someone would have to make a very good case for that.


Mark 

*— Il meglio è l’inimico del bene —*

On Sun, Oct 4, 2015 at 3:02 PM, Richard Wordingham <
richard.wording...@ntlworld.com> wrote:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character?  Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence  U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence , which is being interpreted as
> the codepoint sequence .
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke.  I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805.  However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is .  At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>


Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
IMHO, isolate surrogates are not valid starters for combining sequences,
they must remain isolate : deleting this surrogate in your text editor
should not delete the following combining mark which is a separate cluster
(even if that cluster is defective before the deletion as it has NO base
starter)
For default grapheme clusters, it would be helpful to add a rule to force a
cluster break before and after any lone surogate (i.e. for grapheme cluster
breaking, treat any lone character as if it were a control like NUL U+).

2015-10-04 15:02 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> In the absence of a specific tailoring, is the combination of a lone
> surrogate and a combining mark a user-perceived character?  Does a lone
> surrogate constitute a user-perceived character?
>
> The problem I have is that because of an application-specific bug,
> when I attempt to enter the sequence  U+114BA TIRHUTA SIGN CANDRABINDU>, I appear to be gettig the UTF-16 code
> unit sequence , which is being interpreted as
> the codepoint sequence .
>
> (The problem seems to arise because I use a sequence of two key strokes
> to enter candrabindu, and the application or input mechanism has to undo
> the entry of a supplementary character entered in response to the first
> keystroke.  I've reported the problem as Bug 94753.)
>
> Because the lone surrogate is interpreted as the start of a
> user-perceived character, I can move the cursor to between U+1148F and
> U+D805.  Then pressing the 'delete' key (as opposed to the 'rubout'
> key) will delete the U+D805.  However, if the lone surrogate plus
> combining mark is a user-perceived character, then all I will be left
> with is .  At present the offending application is treating
> Tirhuta combining marks as user-perceived characters, but I suspect the
> application has simply not caught up with Unicode Version 7 yet.
>
> Richard.
>


Re: Deleting Lone Surrogates

2015-10-04 Thread Markus Scherer
I would not spend any time specifying intricate rules for unpaired
surrogates in 16-bit strings, or out-of range values in 32-bit strings.
Most processing will treat them like unassigned characters, like U+50005,
with only default behaviors.
markus


Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 6:02 AM, Richard
  Wordingham wrote:


  In the absence of a specific tailoring, is the combination of a lone
surrogate and a combining mark a user-perceived character?  Does a lone
surrogate constitute a user-perceived character?


In an editing interface, a lone surrogate
  should be a user perceived character, as otherwise you won't be
  able to manually delete it. Markus suggests that it be treated
  like an unassigned code point.
  
  Now, if you follow an unassigned code point with a combining mark,
  what should you get?
  
  For scripts where combining marks are productive, it seems
  counter-productive (pardon the pun) to go and limit this process,
  only to have to update your software every year as a new version
  of Unicode comes out.
  
  (Astute readers will notice that combining marks don't necessarily
  have scripts, nor do unassigned code points, so I'm talking about
  those marks that are used productively with certain scripts and
  particularly those that can be applied widely ouf of context for
  technical purposes)
  
  So, if you allow a generalized algorithm that gloms these marks
  onto any base, even unassigned code points, then it would be
  natural to have this happen to lone surrogates as well, meaning
  that the surrogate cannot be fixed in isolation.  That's tough.
  There are plenty of interfaces where you can't change a base
  character in isolation.
  
  If you have a bug that doesn't let you enter a sequence without
  creating a lone surrogate followed by a combining mark, that's a
  bug...
  
  A./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer  wrote:

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit
> strings. Most processing will treat them like unassigned characters,
> like U+50005, with only default behaviors.

The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 12:38 PM, Richard
  Wordingham wrote:


  On Sun, 4 Oct 2015 10:50:43 -0700
Markus Scherer  wrote:


  
I would not spend any time specifying intricate rules for unpaired
surrogates in 16-bit strings, or out-of range values in 32-bit
strings. Most processing will treat them like unassigned characters,
like U+50005, with only default behaviors.

  
  
The core problem here is that many editors will not allow one to delete
just a non-initial character from a grapheme cluster.  I fear there may
be editors that don't even allow one to delete the final character.
This may not be a problem when one works with a small set of grapheme
clusters, as in French or German, or possibly even Vietnamese, but
becomes a problem when working with such a large set that the notion of
them being user-perceived characters strains credulity.


The problem you are trying to solve is to allow editing on the code
point level, or, if you will, the keystroke level. Generally, there
will be a sweet spot for each language (and each user) with respect
to what to erase or undo. 

For sequences that belong to a given language, you can pick the
behavior that makes most sense in them, but for lone surrogates, by
definition you are dealing with broken text that doesn't follow any
conventions.

It should also be something that doesn't occur commonly. So, for all
of those reasons, I see no particular problem with giving that a
"generic" behavior, which could be that of deleting the entire
combining sequence; especially if your interface normally deletes
sequences as a unit.

If it never treats sequences as units, then I would in fact question
why this should be different for surrogates.

But in any case, the minimal requirement on an editor is that it
lets you delete (and then retype) enough text to get it back to an
uncorrupted state.

  

A stray U+50005 before a combining mark would also be fiddly to get
rid of, but even if the editor does not allow the entry of arbitrary
scalar values, a user might fix the problem by creating an HTML file
containing the character and then copying the character from the HTML
file to a find and replace command.  This trick is unlikely to work for
a lone surrogate.


Catch-22 here. In filtering input to the dialog to prevent it from
being used to corrupt text, you prevent it from being used to repair
text. Interesting.

A./
  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 21:48:12 +0200
Philippe Verdy  wrote:

> 2015-10-04 21:30 GMT+02:00 Richard Wordingham <
> richard.wording...@ntlworld.com>:

> > On Sun, 4 Oct 2015 15:44:32 +0200
> > Mark Davis ☕️  wrote:

> > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does
> > > show the sequence ᒏ�ᒺ as just two grapheme clusters.

> > But that's the sequence , which has no
> > lone surrogates at all!

> Mark just said that it was what was shown, i.e. the lone surrogate got
> treated as U+FFFD.

That's not what the English says, and I'm surprised if that's what a
literal translation into French means.  I do half suspect that he
actually tried to post a lone surrogate.

> However my opinion is that   ᒏ�ᒺ (using U+FFFD substitution) gives 2
> grapheme clusters, I would prefer a solution that gives 3 grapheme
> clusters, as if the lone surrogate was a line-break control, so that
> the third character (combining, but just after the lone surrogate)
> will not combine with it but will be handled as a defective combining
> sequence with no starter at all before it.

I'd much prefer to be able to delete the first character of a grapheme
cluster.  It's annoying to have to retype 4 characters because one's
mistyped the first of the 4 characters in a grapheme cluster.  Removing
the restriction would be much more useful.

Richard.



Re: Acquiring DIS 10646

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 5:30 AM, Sean Leonard
  wrote:

On
  10/3/2015 12:28 PM, Asmus Freytag (t) wrote:
  
  On 10/3/2015 8:15 AM, Sean Leonard wrote:

Thanks.
  
  
  Well, "DIS 10646" is the Draft International Standard,
  particularly Draft 1, from ~1990 or ~1991. (Sometimes it might
  have been called 10646.1.) Therefore it would likely only be
  in print form (or printed and scanned form). It's pretty old.
  What I understand is that Draft 1 got shot down because it was
  at variance with the nascent Unicode effort; Draft 2 was
  eventually adopted as ISO 10646:1993, and is equivalent to
  Unicode 1.1. (10646-1:1993 plus Amendments 5 to 7 = Unicode
  2.0.)
  


Sean,


you never explained your specific interest in this matter.
Personal curiosity? An attempt to write the definite history of
character encoding?

  
  
  A long time ago, in a galaxy far, far away
  
  
  (Okay it really was not that long ago, and it was pretty close at
  hand since it was on this list)
  


The following doesn't really answer my question; the first draft of
10646 seems pretty irrelevant in that context.
However, I do have a small comment on your current project, so I'll
append it here:

  
  I proposed adding C1 Control Pictures to Unicode.
  
  I am resurrecting that effort, but more slowly this time, with
  more research and input from implementers. The requirement is that
  all glyphs for U+ - U+00FF be graphically distinct.
  
  
  Debuggers used to do this by referencing the graphemes in the
  hardware code page, such as Code Page 437, but we have come a long
  way from 1981, so displaying ♣ for 0x05 does not make much modern
  sense. Merely substituting one of the other legacy code pages in
  for 0x80 - 0x9F does not make sense either. The characters of Code
  Page 437 overlap with U+00A0 - U+00FF in that range, for example.
  (Windows-1252 is somewhat more defensible, but Windows-1252 has 5
  unassigned code points so it would be incomplete.)
  


Totally agree that mapping these to random glyphs from 8-bit sets
that happen to have those positions mapped to printable shapes is
not useful.

But this problem is already solved. Implementers already have
solutions, and they do not depend on encoding anything or making any
other changes. They simply show shapes that somehow contain the
abbreviation for the control code, as in this example showing the
line endings from a random text file:



You can see that the shapes do not actually resemble the existing
control pictures' glyph design although the principle is clearly
related. Also notice that the implementation chooses to use
different techniques for showing whitespace.

Since over the 25 years of the standard none of these implementers
ever approached the consortium with a request for a standard set of
character codes, my conclusion would be that this is a solution in
search of a problem.

The case for most of the original control pictures was very marginal
and grounded, if I recall, in specific legacy implementations of
dumb terminals.

Unlike the old host-terminal interfaces, modern debuggers don't send
character streams to a dumb device. There is a whole rendering
architecture that offers plenty choices for substituting different
shapes for certain code points. All that is taking place on a level
where the actual 'codes' used for that are not shared or visible,
reducing the benefit of standardization. As you can see from my
example, the other benefit of standardization, which is a consensus
around a specific set of shapes, as you would have, for example for
standard math symbols, is also absent, because implementers like to
use different techniques - in my example colored dots, arrows and
the like for whitespace and fat heavy black rounded rectangles with
abbreviations for other control codes. And who knows, the formatting
could change -- many debuggers now let you view text data in
different modes, for example.

A./
  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 15:44:32 +0200
Mark Davis ☕️  wrote:

> When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> the sequence ᒏ�ᒺ as just two grapheme clusters.

But that's the sequence , which has no lone
surrogates at all!  (I had to look at the raw email file to be sure of
what the text was - my email client displays U+FFFD and malformed
alleged UTF-8 the same.)  I believe I would have a good chance of
repairing that by replacing U+FFFD by nothing.

It's not even certain that the substitution to replace U+FFFD would
work. With a more fully supported script in LibreOffice, I would have to
switch 'CTL diacritic' matching off and hope that substitution replaced
the shortest match.  That currently works for replacing one Thai
consonant by another.  To systematically replace a non-spacing Thai
character by another, I have to resort to 'regular expression'
search and replace.  I must hope that they never choose to interpret
the search as matching extended grapheme clusters.

Do all Unicode character properties extend to all codepoints?  If not,
how does one tell which do and which don't?  If the Unicode
segmentation algorithms do apply to sequences of codepoints, as
opposed to merely to Unicode strings, then indeed  is
a legacy grapheme cluster.  It's an extremely unhelpful one!

> In #29 we are specifically not concerned about ill-formed text (or
> other degenerate cases). I suppose it would be possible to handle
> isolated surrogates in different way (eg always breaking) if it
> represented a common problem, but someone would have to make a very
> good case for that.

I suppose the argument will go that by using rare scripts or obsolete
characters, one deserves all the problems that one gets.  The only
widely used script where one is likely to encounter lone surrogates is
CJK, and they are less of a problem there.  Ideally, one shouldn't get
isolated surrogates, but when one does, the mechanisms intended to
prevent them occurring can make dealing with them difficult.

Richard.



Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
2015-10-04 21:30 GMT+02:00 Richard Wordingham <
richard.wording...@ntlworld.com>:

> On Sun, 4 Oct 2015 15:44:32 +0200
> Mark Davis ☕️  wrote:
>
> > When I use http://unicode.org/cldr/utility/breaks.jsp, it does show
> > the sequence ᒏ�ᒺ as just two grapheme clusters.
>
> But that's the sequence , which has no lone
> surrogates at all!  (I had to look at the raw email file to be sure of
> what the text was - my email client displays U+FFFD and malformed
> alleged UTF-8 the same.)

Mark just said that it was what was shown, i.e. the lone surrogate got
treated as U+FFFD.
However my opinion is that   ᒏ�ᒺ (using U+FFFD substitution) gives 2
grapheme clusters, I would prefer a solution that gives 3 grapheme
clusters, as if the lone surrogate was a line-break control, so that the
third character (combining, but just after the lone surrogate) will not
combine with it but will be handled as a defective combining sequence with
no starter at all before it.


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 12:30:23 -0700
"Asmus Freytag (t)"  wrote:

> If you have a bug that doesn't let you enter a sequence without
> creating a lone surrogate followed by a combining mark, that's a
> bug...

Unfortunately, the bug appears to be in an ill-defined interface in
which I have observed regression even within the BMP.  We've discussed
the ambiguity of 'delete one character' in the context of normalisation
before on this list, and the surest solution seemed to be for the
application to surrender some control of its 'backing store' to the
input method.

It's conceivable that the input methods that are compatible for the BMP
are incompatible in the supplementary planes. For now, I'm going to
have to either work round the problem by using dead keys instead or be
thankful that the application hasn't caught up with Unicode 7.0.

Richard.


Re: Deleting Lone Surrogates

2015-10-04 Thread Philippe Verdy
The default behavior of unassigned characters are to treat them like base
characters, so if they are followed by a combining mark, it would create a
default grapheme cluster, which is not appropriate here.

Surrogates are not chracters (so they cannot have any character
properties), but they are assigned and so don't have "default" properties
(only meant for *unassigned* codepoints).

I still think that it is safer to treat them (for text segmentation purpose
as pure isolates i.e. exactly like basic controls such as U+ NUL, or
such as the U+FFFD replacement control which is typically used as visible
placeholders for various errors).

For normalisation purpose they should also have combining class 0 (i.e.
acting as blockers against reorderings for canonical equivalences), and not
as "transparent" (discarded and bypassed as if those surrogates were not
present at all).

2015-10-04 19:50 GMT+02:00 Markus Scherer :

> I would not spend any time specifying intricate rules for unpaired
> surrogates in 16-bit strings, or out-of range values in 32-bit strings.
> Most processing will treat them like unassigned characters, like U+50005,
> with only default behaviors.
> markus
>


Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 2:35 PM, Richard
  Wordingham wrote:


  
However my opinion is that   ᒏ�ᒺ (using U+FFFD substitution) gives 2
> grapheme clusters, I would prefer a solution that gives 3 grapheme
> clusters, as if the lone surrogate was a line-break control, so that
> the third character (combining, but just after the lone surrogate)
> will not combine with it but will be handled as a defective combining
> sequence with no starter at all before it.

  
  I'd much prefer to be able to delete the first character of a grapheme
cluster.  It's annoying to have to retype 4 characters because one's
mistyped the first of the 4 characters in a grapheme cluster.  Removing
the restriction would be much more useful.


That makes sense for common typos, less so, for
  uncommon (hopefully) data corruption.
  
  For some languages, you'll be typing several keystrokes, even if
  it's a single code point; there seems to be limited desire to
  allow you to "edit" the keystrokes. For other languages I would
  expect a UI design to cater to what local custom prefers. 
  
  A./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Asmus Freytag (t)

  
  
On 10/4/2015 4:14 PM, Richard
  Wordingham wrote:


  respect to what to erase or undo.

  

  
For sequences that belong to a given language, you can pick the
behavior that makes most sense in them, but for lone surrogates, by
definition you are dealing with broken text that doesn't follow any
conventions.

  
  
Who's 'you'?  Customisation is frequently not available.  In fact, I
don't recall seeing it on offer.


The UI developer. 

And there's nothing Unicode can do about lack of customizability.

A,./

  



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 16:57:15 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 4:14 PM, Richard Wordingham wrote:
> respect to what to erase or undo.

>>> For sequences that belong to a given language, you can pick the
>>> behavior that makes most sense in them, but for lone surrogates, by
>>> definition you are dealing with broken text that doesn't follow any
>>> conventions.
 
>> Who's 'you'?  Customisation is frequently not available.  In fact, I
>> don't recall seeing it on offer.

> The UI developer.
 
> And there's nothing Unicode can do about lack of customizability.

Actually, there is.  I believe suggestions and recommendations in the
technical reports are quite influential.

Richard. 


Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 15:34:13 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 2:35 PM, Richard Wordingham wrote:

>> I'd much prefer to be able to delete the first character of a
>> grapheme
>> cluster.  It's annoying to have to retype 4 characters because one's
>> mistyped the first of the 4 characters in a grapheme cluster.
>> Removing the restriction would be much more useful.

> That makes sense for common typos, less so, for uncommon (hopefully)
> data corruption.

Allowing access within the cluster is generally useful.  Providing more
access just makes it easier to repair things.  One problem is that
there isn't a 'suspend shaping' option to allow one to see what one is
doing.  This matters when canonical combining classes are not available
to sort out the ordering of components.

> For some languages, you'll be typing several keystrokes, even if it's
> a single code point; there seems to be limited desire to allow you to
> "edit" the keystrokes.

The creators of the application do not know how many keystrokes were
used.  A multi-platform application is not likely to take note of what
keys were pressed even when this information is available.

> For other languages I would expect a UI design
> to cater to what local custom prefers.

Local custom?

'Local custom' is usually one of the following:

a) pen and ink, possibly with scraper.

b) typewriter and tippex

c) Hacked ASCII (and similar)

Only with complex ligatures would you not have access to each
character.

The only parallels to what happens now that I can think of that might
count as 'custom' are:

1) European 8-bit codes, where letter plus diacritic is treated as a
unit.

2) Korean, where one couldn't chop and change the individual jamo.

3) Thai, where a tone mark can severely restrict what scraping can do.

A UI design might respond to loud enough howls of user protest.  You
may recall Thai howls of protest when the ability to independently
delete preposed vowels was lost.  Thai may have some complex vowel
symbols, but as far as the grapheme clusters go, *Thai* doesn't get more
complicated than CVT (consonant, vowel (just one!) and tone).  Some of
the minority languages in the Thai script might be a bit more
complicated.

I do recall SIL's split cursor, which attempted to address the
difficulties of navigating through a stack of diacritics.  I miss it,
even though I never got to grips with all its subtleties.

What I believe is much more the case is that Unicode encourages 'one
size fits all'.  There are massive *translation* efforts for user
interfaces.  As to other parts of the text input/output, they are
usually separate from the applications.  The keyboard is almost totally
independent of the application.  Fonts are restricted to attempts to
provide adequate coverage, but the ideal is that the user provides his
own.  I think the LibreOffice search and replace interface says a lot.
It has visible support for Japanese - they holler and may well add
their own support into the core project - and there are some CTL
options which make best sense from the point of view of the Arabic
script.  The limitations on editing are one of the few places where the
UI is under the tight control of the programmers.  By and large, they
seem to be influenced by a few sources, such as the Unicode technical
reports.

Refutation awaited.

Now an attitude of 'one size fits all' does get things done.  It might
be a bit rough, but it's a lot better than nothing.

Richard.


Re: NNBSP and Word Boundaries

2015-10-04 Thread Richard Wordingham
On Fri, 2 Oct 2015 09:25:01 +0200
Mark Davis ☕️  wrote:

> We add:
> 
> WB13c Mongolian_Letter × NNBSP
> WB13d NNBSP × Mongolian_Letter
> 
> *If* we want to also change behavior on the other side of the NNBSP,
> whenever the Mongolian_Letter and NNBSP occur in sequence, we add 2
> additional rules (with the appropriate values for ..., like Numeric)
> 
> WB13c Mongolian_Letter NNBSP   (...)
> WB13d (...) × NNBSP Mongolian_Letter

I'll assume the last two are meant to be WB13e and WB13f.

We can achieve the effects down to the first WB13d simply by changing
NNBSP from XX to MidNumLet.  This would also provide a proper "espace
fine" for French use within numbers
( https://www.druide.com/enquetes/pour-des-espaces-ins%C3%A9cables-impeccables
) to separate groups of 3 digits.  This needs *no* extra rules.

Now for combined numbers and letters, we might consider adding the two
rules:

WB12a Numeric MidNumLet × AHLetter
WB12b Numeric × MidNumLet AHLetter

I think we should go the whole hog, and instead have

WB12c (Numeric|AHLetter) MidNumLetQ × (Numeric|AHLetter)
WB12d (Numeric|AHLetter) × MidNumLetQ (Numeric|AHLetter)

Perhaps there are good reasons against them - I'm not aware of any.  (I
don't think it is wrong to treat "no.2" as a single word.)  These rules
would make the abbreviated names of a good many Thai forms (e.g. คร.๒, a
marriage certificate) into a single word.

WB12c and WB12d overlap with WB6, WB7, WB11 and WB12, which could be
slightly simplified. 

Richard.



Re: Deleting Lone Surrogates

2015-10-04 Thread Richard Wordingham
On Sun, 4 Oct 2015 14:29:16 -0700
"Asmus Freytag (t)"  wrote:

> On 10/4/2015 12:38 PM, Richard Wordingham wrote:

> The problem you are trying to solve is to allow editing on
> the code point level, or, if you will, the keystroke level.

> Generally, there will be a sweet spot for each language (and each
> user) with respect to what to erase or undo.

> For sequences that belong to a given language, you can pick the
> behavior that makes most sense in them, but for lone surrogates, by
> definition you are dealing with broken text that doesn't follow any
> conventions.

Who's 'you'?  Customisation is frequently not available.  In fact, I
don't recall seeing it on offer.

> It should also be something that doesn't occur commonly. So, for all
> of those reasons, I see no particular problem with giving that a
> "generic" behavior, which could be that of deleting the entire
> combining sequence; especially if your interface normally deletes
> sequences as a unit.

> But in any case, the minimal requirement on an editor is that it lets
> you delete (and then retype) enough text to get it back to an
> uncorrupted state.

In the problem I hit, I would nearly be left with two options - never
having CANDRABINDU and always having it preceded by CANDRABINDU.
Whenever I enter CANDRABINDU, it is preceded by the lone surrogate.
Consequently, the option of retyping the sequence is of no avail.
Fortunately, in the application where I met the problem, the lone
surrogates, and nothing else, get deleted when the file is saved. The
problem could very easily be a lot worse.



> Catch-22 here. In filtering input to the dialog to prevent it from
> being used to corrupt text, you prevent it from being used to repair
> text. Interesting.

Not very different to having a very roll-stable aeroplane. If you ever
do end up upside-down, you have a big problem. 

Richard.