Re: Perl6 and accents

2010-05-18 Thread Helmut Wollmersdorfer

Tom Christiansen wrote:


Certainly it's perfectly well known amongst people who deal with
letters--including with the Unicode standard.



Accent does have a colloquial meaning that maps correctly,
but sadly that colloquial definition does not correspond to
the technical definition, so in being clear, you become less
accurate. There is, as far as I'm aware, no good middle
ground, here.



One doesn't *have* to make up play-words.  There's nothing wrong with the
correct terminology.  Calling a mark a mark is pretty darned simple.


Well, scientist are not always happy with Unicode terms, e.g. 
'ideograph' for Han characters, or 'Latin' for Roman scripts. But the 
terms should be used as defined by the standard--as names/identifiers of 
properties.



Unicode has blocks for diacritic marks, and a Diacritic property for
testing whether something is one.  There are 1328 code points whose
canonical decompositions have both both \p{Diacritic} and \pM in them,
946 code points that have only \pM but not \p{Diacritic}, and 197 that 
have \p{Diacritic} but not \pM.


If someone really uses Unicode there is way no around deep knowledge of 
the properties. Such code will use Unicode properties directly, and Perl 
6 should therefore support all the properties.


I still think resorting to talking about accent marks is a bad idea.  
I had somebody the other day thinking that throwing out the accent marks

meant deleting all characters whose code points were over 0x7F--and this
was a recent CompSci major, too.


I know this sort of people. They also believe that UTF-8 is a 2-byte 
encoding.



But that's nothing.  The more you look into it, the weirder it can get,
especially with collation and canonical equivalence, both of which really
require locale knowledge outside the charset itself.


Sure. The specs of Perl 6 still need huge work on the Unicode part.

Helmut Wollmersdorfer


Perl6 and accents

2010-05-17 Thread Tom Christiansen
Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:

  # Perl 6
  /  alpha - [A-Za-z] + /   # All alphabetics except A-Z or a-z
# (i.e. the accented alphabetics)

[Update: Would now need to be +alpha - [A..Za..z] to avoid ambiguity
with Texas quotes, and because we want to reserve whitespace as the first
character inside the angles for other uses.]

Explicit character classes were deliberately made a little less convenient
in Perl 6, because they're generally a bad idea in a Unicode world. For
example, the [A-Za-z] character class in the above examples won't even
match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone
alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham,
Cherokee, or Klingon.

First off, that i.e. the accented alphabetics phrasing is quite incorrect!  
Code like /[^\P{Alpha}A-Za-z]/ matches not just things like

00C1 LATIN CAPITAL LETTER A WITH ACUTE
00C7 LATIN CAPITAL LETTER C WITH CEDILLA
00C8 LATIN CAPITAL LETTER E WITH GRAVE
00E5 LATIN SMALL LETTER A WITH RING ABOVE
00F1 LATIN SMALL LETTER N WITH TILDE

but also of course:

00AA FEMININE ORDINAL INDICATOR
00B5 MICRO SIGN
00BA MASCULINE ORDINAL INDICATOR
00C6 LATIN CAPITAL LETTER AE
00D0 LATIN CAPITAL LETTER ETH
00DE LATIN CAPITAL LETTER THORN
00DF LATIN SMALL LETTER SHARP S
00E6 LATIN SMALL LETTER AE
00F0 LATIN SMALL LETTER ETH
01A6 LATIN LETTER YR
01BA LATIN SMALL LETTER EZH WITH TAIL
01BC LATIN CAPITAL LETTER TONE FIVE
01BF LATIN LETTER WYNN
02C7 CARON
0391 GREEK CAPITAL LETTER ALPHA
0410 CYRILLIC CAPITAL LETTER A

and many, many more.

I'm also disappointed to see perl6 spreading the notion that accent
is somehow a valid synonym for 

diacritical marking 
diacritic marking 
diacritic mark
diacritic 
mark

It's not.  Accent is not a synonym for any of those.  Not all marks are
accents, and not all accents are marks.

I believe what is meant by accent is NFD($char) =~ /\pM/.  Fine: then
say with diacritics, not with accents.

Also, there are many combining characters that aren't accents by any
stretch of term, such as 20E3 COMBINING ENCLOSING KEYCAP, to name just one.
Only three code points have official names that include ACCENT, and even
these are dubious.

Finally, I note also that people use the Alpha property too loosely.  Note
the caron and such above.  One probably wants the LC property instead.

--tom

use charnames ();
use Unicode::Normalize;
for $cp ( 1 .. 0x ) {
$orig  = chr($cp);
$canon  = NFD($orig);  # NFKD gives diff results
## if ($orig =~ /[^\P{Alpha}A-Za-z]/) {
if ($orig =~ /\p{LC}/  $canon !~ /^[A-Za-z]/) {
printf(%c %04X %s\n, $cp, $cp, charnames::viacode($cp));
}
}


Re: Perl6 and accents

2010-05-17 Thread Moritz Lenz
Tom Christiansen wrote:
 Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:

The Exegesis are historical documents, and should be treated as such.
(If any volunteer is around, submitting a patch that puts HISTORICAL
DOCUMENT ONLY in big red letter on these pages would be greatly
appreciated).

If you want to refer to current Perl 6 development, please look at
http://perlcabal.org/syn/S05.html, and
http://svn.pugscode.org/pugs/docs/Perl6/Spec/S05-regex.pod if you plan
to submit patches.

(That said, most of what you wrote still applies; patches to make the
wording clearer are very welcome).

Cheers,
Moritz


Re: Perl6 and accents

2010-05-17 Thread Aaron Sherman
On Mon, May 17, 2010 at 1:52 PM, Tom Christiansen tchr...@perl.com wrote:

 Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads:

  # Perl 6
  /  alpha - [A-Za-z] + /   # All alphabetics except A-Z or a-z
# (i.e. the accented alphabetics)

[Update: Would now need to be +alpha - [A..Za..z] to avoid ambiguity
with Texas quotes, and because we want to reserve whitespace as the
 first
character inside the angles for other uses.]


Why isn't that:

  /+ alpha - [A-Za-z]+ /


 I'm also disappointed to see perl6 spreading the notion that accent
 is somehow a valid synonym for

diacritical marking
diacritic marking
diacritic mark
diacritic
mark

 It's not.  Accent is not a synonym for any of those.  Not all marks are
 accents, and not all accents are marks.


I agree that it's a rather folksy way of saying them funny letters. On
the other hand, I think that was the intent. It's very hard to find ways to
describe Unicode spaces in ways that the average coder (not the average
person, which is a small help) will grasp immediately. diacritical isn't a
word that most folks know, even among programmers. Accent does have
a colloquial meaning that maps correctly, but sadly that colloquial
definition does not correspond to the technical definition, so in being
clear, you become less accurate. There is, as far as I'm aware, no good
middle ground, here.

I think having the exegeses be more colloquial and the synopses be more
technically accurate makes a fair amount of sense, though perhaps footnoting
the technically inaccurate elements of the exegeses would make sense.

To the question of the exegeses being out of date: if they are out of date,
why are we keeping them around? Is there value there? I understand the value
in keeping the apocalypses around, but that's due to their nature as the
first draft of the standard. The exegeses have no such status.

Personally, I'd rather see them updated than thrown out, but I tried writing
examples just for a few elements of S29 back in the day, and found the
moving target to be too painful. Maybe Perl 6 has slowed down enough that
it's more practical now?

-- 
Aaron Sherman
Email or GTalk: a...@ajs.com
http://www.ajs.com/~ajs