Re: Perl6 regexes and UTS#18

2011-02-09 Thread Helmut Wollmersdorfer

Larry Wall wrote:

On Sun, Feb 06, 2011 at 08:59:51PM +0100, Helmut Wollmersdorfer wrote:
: Tom Christiansen wrote:



: > I'm also curious whether there are active plans to address the
: > tr18 requirements in perl6 regexes.  It would be a wonderful
: > feather in perl6's cap to be able to legitimately claim Level 2
: > or even Level 3 compliance, since besides perl5, only ICU right
: > now manages even Level 1, with everybody else *very* far behind.

Anyone who implements all of S05 can claim Level 2.


I will remember this day as a big milestone.


: I would like to have:
: - all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
: - most of the features of ICU (e.g. transforms, localisation)
: - normalization form, local-support and tailored $features on string
:   level (_not_ lexical context)
: 
: This means that any string can be in or can transformed into the form

: - Byte
: - NFD, NFC, NFKD, NFKC
: - NFG (Default Grapheme Clusters)
: - NFGT (Tailored Grapheme Cluster)
: 
: Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.

: German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).
: Without a Language-Local a NFG-string is handled as NFG-string
: (default).
: 
: Tailored also means, that the user (Perl6 programmer) can tailor the

: relevant mechanisms (formatting, normalization, collation,
: properties, case folding etc.).



I think we're mostly on the same page, though I think we'll need
to thrash out how much of this info is carried in the string type,
and how much is implicit to the current language's view of strings
(where the programmer gets to choose which viewpoint to take, such as
"graphemes", "codepoints", "bytes", etc.).  We want to give access to
strings from other viewpoints to the extent that we can, but the current
viewpoint will tend to produce different default results than other
viewpoints.  By default, for instance, Perl 6 is supposed to be warped
toward NFG semantics, and wants to view all strings through that lens
unless otherwise instructed.  The "lesser" strings will tend to look
more like buffers to such a view.


I agree. NFG is the most convenient mode as a general view of strings. 
And it conforms to Level 2, where Unicode guaranties stability.


Sure, if one wants to work at "codepoint" (NFC, NFD) level, he must be 
aware, that "2.1 Canonical Equivalents" is not fullfilled.



: >   =Level 2: Extended Unicode Support
: >RL2.1Canonical Equivalents



: IMHO not specced



To me, this is also implied by NFG semantics.


I agree. After a longish discussion with Tom presenting dozens of 
corner-cases to me, we now know better that and why it works with all 
the normalization rules.




: >RL2.4Default Loose Matches   RL2.5Name Properties
: RL2.6Wildcard Properties
: 
: IMHO not specced


Was already specced as a "smartmatch", but I've made the name matching more 
explicit.


Thx

: >   =Level 3: Tailored Unicode Support
: 
: IMHO not specced


We've had :chars conjecturing tailoring for quite a while now.  Of course,
the actual syntax is negotiable.



: It would be easier to reference the appropriate chapters of the
: Unicode standard in the specification of Perl6. This would make
: Unicode test-cases reusable. And an implementation should always
: declare, which features of Unicode are implemented (and which not)
: in which version of Unicode.



We can certainly use a lot more work on the details, but the intent
should be clear that we want Perl 6 to be support world-class Unicode.


Agreed.

Tailoring as kind of overloading breaks - intended - the standard behaviour.

But an implementation of NFG must have a mechanism to assign codepoints 
and provide properties of NFG-codepoints during runtime. The same 
mechanism can be opened for tailoring, and - we should not forget them - 
characters in the private range. Of course, overloading existing 
properties like case folding of the Turkish 'i' is not the same as just 
adding NFG- or private properties.


Helmut Wollmersdorfer



Re: Perl6 regexes and UTS#18

2011-02-09 Thread Larry Wall
On Sun, Feb 06, 2011 at 08:59:51PM +0100, Helmut Wollmersdorfer wrote:
: Tom Christiansen wrote:
: > Has anybody specifically looked at how Perl6 regexes might map to
: > the various requirements of UTS#18, Unicode Regular Expressions?
: >
: > http://unicode.org/reports/tr18/
: 
: Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).

I believe the spec now supports all of Level 2 explicitly, more or less.

: > I ask because to my inexperienced eye, quite a few perl6isms are
: > *much* better at this than in perl5 obtain, and so I wondered
: > whether this was by conscious intent and design.  Is/Was it?
: 
: Seems intended.

Well, some of it is convergent evolution (or divergent, in some cases),
but we have paid a certain amount of attention to what the unicode
folks have to say over the years.

: > I'm also curious whether there are active plans to address the
: > tr18 requirements in perl6 regexes.  It would be a wonderful
: > feather in perl6's cap to be able to legitimately claim Level 2
: > or even Level 3 compliance, since besides perl5, only ICU right
: > now manages even Level 1, with everybody else *very* far behind.

Anyone who implements all of S05 can claim Level 2.

: I would like to have:
: - all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
: - most of the features of ICU (e.g. transforms, localisation)
: - normalization form, local-support and tailored $features on string
:   level (_not_ lexical context)
: 
: This means that any string can be in or can transformed into the form
: - Byte
: - NFD, NFC, NFKD, NFKC
: - NFG (Default Grapheme Clusters)
: - NFGT (Tailored Grapheme Cluster)
: 
: Tailored means, that the Graphem (NFG) needs a Language-Local (e.g.
: German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation).
: Without a Language-Local a NFG-string is handled as NFG-string
: (default).
: 
: Tailored also means, that the user (Perl6 programmer) can tailor the
: relevant mechanisms (formatting, normalization, collation,
: properties, case folding etc.).

I think we're mostly on the same page, though I think we'll need
to thrash out how much of this info is carried in the string type,
and how much is implicit to the current language's view of strings
(where the programmer gets to choose which viewpoint to take, such as
"graphemes", "codepoints", "bytes", etc.).  We want to give access to
strings from other viewpoints to the extent that we can, but the current
viewpoint will tend to produce different default results than other
viewpoints.  By default, for instance, Perl 6 is supposed to be warped
toward NFG semantics, and wants to view all strings through that lens
unless otherwise instructed.  The "lesser" strings will tend to look
more like buffers to such a view.

: > TR18 specifies three levels of support (Basic, Extended, and Tailored),
: > with each having specific, reasonably well-defined requirements:
: 
: There is a lot of work in the UNICODE standard - using it costs
: nothing, but saves time. E.g. the allowed characters for identifiers
: can be defined with the Unicode properties 'ID_Start' and
: 'ID_Continue', Grapheme with Grapheme_Base, Grapheme_Extend etc.
: 
: >   =Level 1: Basic Unicode Support
: [...]
: >RL1.3Subtraction and Intersection
: 
: IMHO not complete

Fixed.  Now have |, &, and ^ with expected precedence, and () for bracketing.

: >RL1.5Simple Loose Matches
: 
: Hmm ...

Clarified.

: >RL1.6Line Boundaries
: 
: can be defined

Now we have an extensible boundary syntax with a different namespace than
normal rules.  <|w> is a word boundary, etc.

: >RL1.7Supplementary Code Points
: 
: IMHO not specced

Yes, already implicit to the NFG view of reality.

: >   =Level 2: Extended Unicode Support
: >RL2.1Canonical Equivalents
: 
: IMHO not specced

To me, this is also implied by NFG semantics.

: >RL2.4Default Loose Matches   RL2.5Name Properties
: RL2.6Wildcard Properties
: 
: IMHO not specced

Was already specced as a "smartmatch", but I've made the name matching more 
explicit.

: >   =Level 3: Tailored Unicode Support
: 
: IMHO not specced

We've had :chars conjecturing tailoring for quite a while now.  Of course,
the actual syntax is negotiable.

: It would be easier to reference the appropriate chapters of the
: Unicode standard in the specification of Perl6. This would make
: Unicode test-cases reusable. And an implementation should always
: declare, which features of Unicode are implemented (and which not)
: in which version of Unicode.

We can certainly use a lot more work on the details, but the intent
should be clear that we want Perl 6 to be support world-class Unicode.

Larry


Re: Perl6 regexes and UTS#18

2011-02-06 Thread Helmut Wollmersdorfer

Tom Christiansen wrote:
> Has anybody specifically looked at how Perl6 regexes might map to
> the various requirements of UTS#18, Unicode Regular Expressions?
>
> http://unicode.org/reports/tr18/

Roughly Perl6 supports Level 2 (I did a fast check of UTS#18 and the specs).

> I ask because to my inexperienced eye, quite a few perl6isms are
> *much* better at this than in perl5 obtain, and so I wondered
> whether this was by conscious intent and design.  Is/Was it?

Seems intended.

> I'm also curious whether there are active plans to address the
> tr18 requirements in perl6 regexes.  It would be a wonderful
> feather in perl6's cap to be able to legitimately claim Level 2
> or even Level 3 compliance, since besides perl5, only ICU right
> now manages even Level 1, with everybody else *very* far behind.

I would like to have:
- all Unicode features of Perl5 (UCD, charnames, Normalize, properties)
- most of the features of ICU (e.g. transforms, localisation)
- normalization form, local-support and tailored $features on string
  level (_not_ lexical context)

This means that any string can be in or can transformed into the form
- Byte
- NFD, NFC, NFKD, NFKC
- NFG (Default Grapheme Clusters)
- NFGT (Tailored Grapheme Cluster)

Tailored means, that the Graphem (NFG) needs a Language-Local (e.g. 
German-Swiss-Spelling_1996, or German-Austrian-PhoneBookCollation). 
Without a Language-Local a NFG-string is handled as NFG-string (default).


Tailored also means, that the user (Perl6 programmer) can tailor the 
relevant mechanisms (formatting, normalization, collation, properties, 
case folding etc.).


> TR18 specifies three levels of support (Basic, Extended, and Tailored),
> with each having specific, reasonably well-defined requirements:

There is a lot of work in the UNICODE standard - using it costs nothing, 
but saves time. E.g. the allowed characters for identifiers can be 
defined with the Unicode properties 'ID_Start' and 'ID_Continue', 
Grapheme with Grapheme_Base, Grapheme_Extend etc.


>   =Level 1: Basic Unicode Support
[...]
>RL1.3Subtraction and Intersection

IMHO not complete

>RL1.5Simple Loose Matches

Hmm ...

>RL1.6Line Boundaries

can be defined

>RL1.7Supplementary Code Points

IMHO not specced

>   =Level 2: Extended Unicode Support
>RL2.1Canonical Equivalents

IMHO not specced

>RL2.4Default Loose Matches   RL2.5Name Properties 
   RL2.6Wildcard Properties


IMHO not specced

>   =Level 3: Tailored Unicode Support

IMHO not specced

It would be easier to reference the appropriate chapters of the Unicode 
standard in the specification of Perl6. This would make Unicode 
test-cases reusable. And an implementation should always declare, which 
features of Unicode are implemented (and which not) in which version of 
Unicode.


Helmut Wollmersdorfer



Perl6 regexes and UTS#18

2011-02-05 Thread Tom Christiansen
Has anybody specifically looked at how Perl6 regexes might map to
the various requirements of UTS#18, Unicode Regular Expressions?

http://unicode.org/reports/tr18/

I ask because to my inexperienced eye, quite a few perl6isms are
*much* better at this than in perl5 obtain, and so I wondered
whether this was by conscious intent and design.  Is/Was it?

I'm also curious whether there are active plans to address the
tr18 requirements in perl6 regexes.  It would be a wonderful
feather in perl6's cap to be able to legitimately claim Level 2
or even Level 3 compliance, since besides perl5, only ICU right
now manages even Level 1, with everybody else *very* far behind.

TR18 specifies three levels of support (Basic, Extended, and Tailored),
with each having specific, reasonably well-defined requirements:

  =Level 1: Basic Unicode Support
   RL1.1Hex Notation
   RL1.2Properties 
   RL1.2a   Compatibility Properties  
   RL1.3Subtraction and Intersection 
   RL1.4Simple Word Boundaries  
   RL1.5Simple Loose Matches   
   RL1.6Line Boundaries   
   RL1.7Supplementary Code Points

  =Level 2: Extended Unicode Support
   RL2.1Canonical Equivalents   
   RL2.2Default Grapheme Clusters  
   RL2.3Default Word Boundaries   
   RL2.4Default Loose Matches
   RL2.5Name Properties 
   RL2.6Wildcard Properties

  =Level 3: Tailored Unicode Support
   RL3.1Tailored Punctuation
   RL3.2Tailored Grapheme Clusters 
   RL3.3Tailored Word Boundaries  
   RL3.4Tailored Loose Matches   
   RL3.5Tailored Ranges 
   RL3.6Context Matching   
   RL3.7Incremental Matches   
 ( RL3.8Unicode Set Sharing )
   RL3.9Possible Match Sets  
   RL3.10   Folded Matching 
   RL3.11   Submatchers

thanks,

--tom