Re: Perl6 and accents
Tom Christiansen wrote: Certainly it's perfectly well known amongst people who deal with letters--including with the Unicode standard. Accent does have a colloquial meaning that maps correctly, but sadly that colloquial definition does not correspond to the technical definition, so in being clear, you become less accurate. There is, as far as I'm aware, no good middle ground, here. One doesn't *have* to make up play-words. There's nothing wrong with the correct terminology. Calling a mark a mark is pretty darned simple. Well, scientist are not always happy with Unicode terms, e.g. 'ideograph' for Han characters, or 'Latin' for Roman scripts. But the terms should be used as defined by the standard--as names/identifiers of properties. Unicode has blocks for diacritic marks, and a Diacritic property for testing whether something is one. There are 1328 code points whose canonical decompositions have both both \p{Diacritic} and \pM in them, 946 code points that have only \pM but not \p{Diacritic}, and 197 that have \p{Diacritic} but not \pM. If someone really uses Unicode there is way no around deep knowledge of the properties. Such code will use Unicode properties directly, and Perl 6 should therefore support all the properties. I still think resorting to talking about accent marks is a bad idea. I had somebody the other day thinking that throwing out the accent marks meant deleting all characters whose code points were over 0x7F--and this was a recent CompSci major, too. I know this sort of people. They also believe that UTF-8 is a 2-byte encoding. But that's nothing. The more you look into it, the weirder it can get, especially with collation and canonical equivalence, both of which really require locale knowledge outside the charset itself. Sure. The specs of Perl 6 still need huge work on the Unicode part. Helmut Wollmersdorfer
Perl6 and accents
Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads: # Perl 6 / alpha - [A-Za-z] + / # All alphabetics except A-Z or a-z # (i.e. the accented alphabetics) [Update: Would now need to be +alpha - [A..Za..z] to avoid ambiguity with Texas quotes, and because we want to reserve whitespace as the first character inside the angles for other uses.] Explicit character classes were deliberately made a little less convenient in Perl 6, because they're generally a bad idea in a Unicode world. For example, the [A-Za-z] character class in the above examples won't even match standard alphabetic Latin-1 characters like 'Ã', 'é', 'ø', let alone alphabetic characters from code-sets such as Cyrillic, Hiragana, Ogham, Cherokee, or Klingon. First off, that i.e. the accented alphabetics phrasing is quite incorrect! Code like /[^\P{Alpha}A-Za-z]/ matches not just things like 00C1 LATIN CAPITAL LETTER A WITH ACUTE 00C7 LATIN CAPITAL LETTER C WITH CEDILLA 00C8 LATIN CAPITAL LETTER E WITH GRAVE 00E5 LATIN SMALL LETTER A WITH RING ABOVE 00F1 LATIN SMALL LETTER N WITH TILDE but also of course: 00AA FEMININE ORDINAL INDICATOR 00B5 MICRO SIGN 00BA MASCULINE ORDINAL INDICATOR 00C6 LATIN CAPITAL LETTER AE 00D0 LATIN CAPITAL LETTER ETH 00DE LATIN CAPITAL LETTER THORN 00DF LATIN SMALL LETTER SHARP S 00E6 LATIN SMALL LETTER AE 00F0 LATIN SMALL LETTER ETH 01A6 LATIN LETTER YR 01BA LATIN SMALL LETTER EZH WITH TAIL 01BC LATIN CAPITAL LETTER TONE FIVE 01BF LATIN LETTER WYNN 02C7 CARON 0391 GREEK CAPITAL LETTER ALPHA 0410 CYRILLIC CAPITAL LETTER A and many, many more. I'm also disappointed to see perl6 spreading the notion that accent is somehow a valid synonym for diacritical marking diacritic marking diacritic mark diacritic mark It's not. Accent is not a synonym for any of those. Not all marks are accents, and not all accents are marks. I believe what is meant by accent is NFD($char) =~ /\pM/. Fine: then say with diacritics, not with accents. Also, there are many combining characters that aren't accents by any stretch of term, such as 20E3 COMBINING ENCLOSING KEYCAP, to name just one. Only three code points have official names that include ACCENT, and even these are dubious. Finally, I note also that people use the Alpha property too loosely. Note the caron and such above. One probably wants the LC property instead. --tom use charnames (); use Unicode::Normalize; for $cp ( 1 .. 0x ) { $orig = chr($cp); $canon = NFD($orig); # NFKD gives diff results ## if ($orig =~ /[^\P{Alpha}A-Za-z]/) { if ($orig =~ /\p{LC}/ $canon !~ /^[A-Za-z]/) { printf(%c %04X %s\n, $cp, $cp, charnames::viacode($cp)); } }
Re: Perl6 and accents
Tom Christiansen wrote: Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads: The Exegesis are historical documents, and should be treated as such. (If any volunteer is around, submitting a patch that puts HISTORICAL DOCUMENT ONLY in big red letter on these pages would be greatly appreciated). If you want to refer to current Perl 6 development, please look at http://perlcabal.org/syn/S05.html, and http://svn.pugscode.org/pugs/docs/Perl6/Spec/S05-regex.pod if you plan to submit patches. (That said, most of what you wrote still applies; patches to make the wording clearer are very welcome). Cheers, Moritz
Re: Perl6 and accents
On Mon, May 17, 2010 at 1:52 PM, Tom Christiansen tchr...@perl.com wrote: Exegesis 5 @ http://dev.perl.org/perl6/doc/design/exe/E05.html reads: # Perl 6 / alpha - [A-Za-z] + / # All alphabetics except A-Z or a-z # (i.e. the accented alphabetics) [Update: Would now need to be +alpha - [A..Za..z] to avoid ambiguity with Texas quotes, and because we want to reserve whitespace as the first character inside the angles for other uses.] Why isn't that: /+ alpha - [A-Za-z]+ / I'm also disappointed to see perl6 spreading the notion that accent is somehow a valid synonym for diacritical marking diacritic marking diacritic mark diacritic mark It's not. Accent is not a synonym for any of those. Not all marks are accents, and not all accents are marks. I agree that it's a rather folksy way of saying them funny letters. On the other hand, I think that was the intent. It's very hard to find ways to describe Unicode spaces in ways that the average coder (not the average person, which is a small help) will grasp immediately. diacritical isn't a word that most folks know, even among programmers. Accent does have a colloquial meaning that maps correctly, but sadly that colloquial definition does not correspond to the technical definition, so in being clear, you become less accurate. There is, as far as I'm aware, no good middle ground, here. I think having the exegeses be more colloquial and the synopses be more technically accurate makes a fair amount of sense, though perhaps footnoting the technically inaccurate elements of the exegeses would make sense. To the question of the exegeses being out of date: if they are out of date, why are we keeping them around? Is there value there? I understand the value in keeping the apocalypses around, but that's due to their nature as the first draft of the standard. The exegeses have no such status. Personally, I'd rather see them updated than thrown out, but I tried writing examples just for a few elements of S29 back in the day, and found the moving target to be too painful. Maybe Perl 6 has slowed down enough that it's more practical now? -- Aaron Sherman Email or GTalk: a...@ajs.com http://www.ajs.com/~ajs