Re: Unicode Categories

2010-11-12 Thread karl williamson

Tom Christiansen wrote:

Patrick wrote:

: > * Almost. E.g. isL would be nice to have as well.
:
: Those exist also:
:
:  $ ./perl6
:  > say 'abCD34' ~~ /  /
:  a
:  > say 'abCD34' ~~ /  /
:  3
:  >

They may exist, but I'm not certain it's a good idea to encourage
the Is_XXX approach on *anything* except Script=XXX properties.  


They certainly don't work on everything, you know.

Also, I can't for the life of me why one would ever write  when
 is so much more obvious; similarly, for  over .  
Just because you can do so, doesn't mean you necessarily should.


http://unicode.org/reports/tr18/#Categories

The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.

It is strongly recommended that both names be recognized, and that
loose matching of property names be used, whereby the case
distinctions, whitespace, hyphens, and underbar are ignored.

Furthermore, be aware that the Number property is *NOT* the same
as the Decimal_Number property.  In perl5, if one wants [0-9], then
one expresses it exactly that way, since that's a lot shorter than
writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number.

Again, please that Number is far broader than even Decimal_Number,
which is itself almost certainly broader than you're thinking.

Here's a trio of little programs specifically designed to help scout
out Unicode characters and their properties.  They work best on 5.12+,
but should be ok on 5.10, too.

--tom



The 'Is' prefix can be used on any property in 5.12 for which there is 
no naming conflict.  The only naming conflicts are certain of the block 
properties, such as Arabic.  IsArabic means the Arabic script.  InArabic 
means the base Arabic block.  Personally, I find Is and In unintuitive, 
and prefer to write sc=arabic or blk=arabic instead.


When Unicode proposed to add some properties in 5.2 that started with 
'Is', there was significant enough protest that they backed off, and 
promised never to do it again, adding a stability policy to 6.0 to that 
effect.  Apparently a number of languages use 'Is' as a prefix.


Re: Unicode Categories

2010-11-11 Thread Tom Christiansen
>The 'Is' prefix can be used on any property in 5.12 for which there is 
>no naming conflict.  The only naming conflicts are certain of the block 
>properties, such as Arabic.  IsArabic means the Arabic script.  InArabic 
>means the base Arabic block.  Personally, I find Is and In unintuitive, 
>and prefer to write sc=arabic or blk=arabic instead.

I agree.

>When Unicode proposed to add some properties in 5.2 that started with 
>'Is', there was significant enough protest that they backed off, and 
>promised never to do it again, adding a stability policy to 6.0 to that 
>effect.  Apparently a number of languages use 'Is' as a prefix.

Yes, that's right.  Even worse, there are languages that are very very 
bad about "Is" vs "In", giving the wrong sense to them.

--tom


Re: Unicode Categories

2010-11-10 Thread Tom Christiansen
Patrick wrote:

: > * Almost. E.g. isL would be nice to have as well.
:
: Those exist also:
:
:  $ ./perl6
:  > say 'abCD34' ~~ /  /
:  a
:  > say 'abCD34' ~~ /  /
:  3
:  >

They may exist, but I'm not certain it's a good idea to encourage
the Is_XXX approach on *anything* except Script=XXX properties.  

They certainly don't work on everything, you know.

Also, I can't for the life of me why one would ever write  when
 is so much more obvious; similarly, for  over .  
Just because you can do so, doesn't mean you necessarily should.

http://unicode.org/reports/tr18/#Categories

The recommended names for UCD properties and property values are in
PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue].
There are both abbreviated names and longer, more descriptive names.

It is strongly recommended that both names be recognized, and that
loose matching of property names be used, whereby the case
distinctions, whitespace, hyphens, and underbar are ignored.

Furthermore, be aware that the Number property is *NOT* the same
as the Decimal_Number property.  In perl5, if one wants [0-9], then
one expresses it exactly that way, since that's a lot shorter than
writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number.

Again, please that Number is far broader than even Decimal_Number,
which is itself almost certainly broader than you're thinking.

Here's a trio of little programs specifically designed to help scout
out Unicode characters and their properties.  They work best on 5.12+,
but should be ok on 5.10, too.

--tom


unitrio.tar.gz
Description: application/tar


Re: Unicode Categories

2010-11-10 Thread Tom Christiansen
Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010:

>> Sorry if this is the wrong forum. I was wondering if there was a way to
>> specify unicode
>> categoriesin
>> a regular expression (and hence a grammar), or if there would be any
>> consideration for adding support for that (requiring some kind of special
>> syntax).

> Unicode categories are done using assertion syntax with "is" followed by
> the category name.  Thus  (uppercase letter),  (decimal digit), 
>  (space separator), etc.

> This even works in Rakudo today:

>$ ./perl6
>> say 'abcdEFG' ~~ /  /
>E

> They can also be combined, as in +isLu+isLt  (uppercase+titlecase).
> The relevant section of the spec is in Synopsis 5; search for "Unicode
> properties are always available with a prefix".
 
> Hope this helps!

Actually, that quote from Synopsis raises more questions than it answers.

Below I've annonated the three output groups with (letters):

% uniprops -a A
U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }:
 (A)\w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
 (B)AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned
Cased Cased_Letter LC Changes_When_Casefolded CWCF
Changes_When_Casemapped CWCM Changes_When_Lowercased CWL
Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph
GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_
Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha
PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word
XID_Continue XIDC XID_Start XIDS
 (C)Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right
Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin
Blk=ASCII Canonical_Combining_Class:0
Canonical_Combining_Class=Not_Reordered
Canonical_Combining_Class:Not_Reordered Ccc=NR
Canonical_Combining_Class:NR Decomposition_Type:None Dt=None
East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow
Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
Hangul_Syllable_Type=Not_Applicable
Hangul_Syllable_Type:Not_Applicable Hst=NA
Joining_Group:No_Joining_Group Jg=NoJoiningGroup
Joining_Type:Non_Joining Jt=U Joining_Type:U
Joining_Type=Non_Joining Script=Latin Line_Break:AL
Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None
Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0
Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0
Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1
Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn
Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP
Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter

What that means is that the "B" properties are properties from 
the *General* category.  They may all be referred to as \p{X} 
or \p{IsX}, \p{General_Category=X} or \p{General_Category:X}, 
and \p{GC=X} or \p{GC:X}.

I have a feeling that your synopsis quote is referring only to 
type B properties alone.  It is not talking about type C properties, 
which must also be accounted for.

--tom


Re: Unicode Categories

2010-11-10 Thread Chase Albert
Even awesomer, thank you again.

On Wed, Nov 10, 2010 at 13:28, Patrick R. Michaud wrote:

> On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote:
> > That's exactly what I was looking for*. Awesome, thank you.
> >
> > * Almost. E.g. isL would be nice to have as well.
>
> Those exist also:
>
>  $ ./perl6
>  > say 'abCD34' ~~ /  /
>  a
>  > say 'abCD34' ~~ /  /
>  3
>  >
>
> Pm
>


Re: Unicode Categories

2010-11-10 Thread Patrick R. Michaud
On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote:
> That's exactly what I was looking for*. Awesome, thank you.
> 
> * Almost. E.g. isL would be nice to have as well.

Those exist also:

  $ ./perl6
  > say 'abCD34' ~~ /  /
  a
  > say 'abCD34' ~~ /  /
  3
  > 

Pm


Re: Unicode Categories

2010-11-10 Thread Chase Albert
That's exactly what I was looking for*. Awesome, thank you.

~Cheers


* Almost. E.g. isL would be nice to have as well.

On Wed, Nov 10, 2010 at 13:15, Patrick R. Michaud wrote:

> "Unicode
> properties are always available with a prefix"
>


Re: Unicode Categories

2010-11-10 Thread Patrick R. Michaud
On Wed, Nov 10, 2010 at 01:03:26PM -0500, Chase Albert wrote:
> Sorry if this is the wrong forum. I was wondering if there was a way to
> specify unicode
> categoriesin
> a regular expression (and hence a grammar), or if there would be any
> consideration for adding support for that (requiring some kind of special
> syntax).

Unicode categories are done using assertion syntax with "is" followed by
the category name.  Thus  (uppercase letter),  (decimal digit), 
 (space separator), etc.

This even works in Rakudo today:

$ ./perl6
> say 'abcdEFG' ~~ /  /
E

They can also be combined, as in +isLu+isLt  (uppercase+titlecase).
The relevant section of the spec is in Synopsis 5; search for "Unicode
properties are always available with a prefix".

Hope this helps!

Pm