Re: Unicode in 'NFG' formation ?

2009-05-20 Thread Helmut Wollmersdorfer

Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:



2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?



Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?


This will work in most cases, but e.g. not with the property 
ASCII_Hex_Digit.


LATIN SMALL LETTER A is ASCII_Hex_Digit
but
GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE is_not 
ASCII_Hex_Digit


I will try to generate some millions of cases based on nfc(nfd($string)) 
to find out the best inheritance rules.


4) Should the definition of graphemes conform to Unicode Standard Annex  
#29 'grapheme clusters'? Wich level - legacy, extended or tailored?



No opinion, other than that we're aiming for the most modern
formulation that doesn't implicitly cede declarational control to
something out of the control of Perl 6 declarations.  (See locales for
an example of something Perl 6 ignores in the absence of an explicit
declaration to pay attention to them.)  So just guessing from the
names without reading the Annex in question, not legacy, but probably
extended, with explicitly tailoring allowed by declaration.  (Unless
extended has some dire performance or policy consequences that would
be contraindicative...)


Will look into ICU what's supported.


So as long as we stay inside these fundamental Perl 6 design
principles, feel free to whack on the specs.


OK. Hopefully some Indic, Arabic and Asian natives review this.

Helmut Wollmersdorfer


Re: Unicode in 'NFG' formation ?

2009-05-20 Thread John M. Dlugosz

Larry Wall larry-at-wall.org |Perl 6| wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:
  

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE



Yes, presumably that comes with the normalization part of NFG.
We're not aiming for round-tripping of synthetic codepoints, just
as NFC doesn't do round-tripping of sequences that have precomposed
codepoints.  We're really just extending the NFC notion a bit further
to encompass temporary precomposed codepoints.

  
Unique for asking for the name, not when specifying the name.  Just as 
with the code-point order, any combination that means the same should 
give the same grapheme, just as if you had create the code point 
sequence first.  Perhaps you are not realizing that the different 
classes of modifiers are independent.  You could say DOT ABOVE AND DOT 
ABOVE and get the same thing as DOT BELOW and DOT ABOVE.





2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?



Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?

  
Depends on the property!  Being a modifier, for example.  A detailed 
look would be needed to decide which properties just pass through to the 
base char, which are enhanced (e.g. letter becomes letter with 
modifiers), which don't make sense, which are mostly OK but change 
sometimes, etc.





Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer

John M. Dlugosz wrote:
I was going over S02, and found it opens with, By default Perl presents 
Unicode in NFG formation, where each grapheme counts as one character.


I looked up NFG, and found it to be an invention of this group, but 
didn't find any details when I tried to chase down the links.


As Durran already wrote, the only definition is in 
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html 
 which references 'Unicode Normalization Forms' 
http://www.unicode.org/reports/tr15/.


Also there is a reference to
The Unicode Standard defines a grapheme cluster (commonly simplified to 
just grapheme). IMHO the authors meant this document:


 Unicode Standard Annex #29
 Unicode Text Segmentation
 http://unicode.org/reports/tr29/

This opens a whole bunch of questions for me.  


I have many unanswered questions [1] about graphemes.

If you mean that the 
default for what the individual items in a string are is graphemes, OK, 
but what does that have to do with parsing source code?  


First - nothing. S01: Perl 6 is written in Unicode. Developers can 
choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl 
source code. Characters outside the ASCII range can be used for 
identifiers, literals, and syntactic punctuation (e.g. 'bracketing 
pairs').


It's a problem of the parser to handle it correctly.

Even so, that's 
not something that would be called a Normalization Form.


Not in Unicode, but it can be called Grapheme Composition.

Thus

\c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE]
\c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE]

should all lead to the same grapheme (my personal assumption).

Character set encodings and stuff is one of my strengths.  I'd like to 
straighten this out, and can certainly straighten out the wording, but 
first need to know what you meant by that.


What's specified:
1) A grapheme is 1 character, thus has 'length' 1.
2) A grapheme has a unique internal representation as an integer for 
some life-time (process), outside the Unicode codepoints.

3) Graphemes can be normalized to NFD, NFC etc.

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE
2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?
3) Details of 'life-time', round-trip.
4) Should the definition of graphemes conform to Unicode Standard Annex 
#29 'grapheme clusters'? Wich level - legacy, extended or tailored?


Helmut Wollmersdorfer




Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer

Darren Duncan wrote:

Since you seem eager, I recommend you start with porting the Parrot PDD 
28 to a new Perl 6 Synopsis 15, and continue from there.


IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote:
 Darren Duncan wrote:

 Since you seem eager, I recommend you start with porting the Parrot PDD
 28 to a new Perl 6 Synopsis 15, and continue from there.

 IMHO we need some people for a broad discussion on the details first.

 Helmut Wollmersdorfer


-- 
Sent from my mobile device

Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

If you haven't read the PDD, it's a good start.

To summarize, probably oversimplifying badly:

1. A grapheme is a character *as seen on the page.*  That is, if 
composing a + dot above + dot below produces an a with dots above 
and below it, then THAT is the grapheme.


2. Unicode has a lot of characters that are single code points 
representing a complex grapheme. For example, the A + ring above 
composition shows up as the Angstrom symbol.


3. But on the other hand, some combination of basic characters plus 
combining marks DO NOT have a single code point that represents them. 
For example, while your girlfriend might compose dotless lowercase i 
with combining heart above to produce an i with a heart instead of a 
dot, there isn't a single codepoint in Unicode for that. (Unless 
girly-grrls got their own code page. Maybe in Unicode 6...)


4. Since that's a considerable PITA to deal with, we now have NFG 
format, which really should have been called NFW format, IMO. (W = 
widechars, natch.) Every combination of basic plus combining marks 
*that gets used* will have a single grapheme allocated. Many of them, 
like the Angstrom symbol, or O + combining röckdöts, will already 
have a real unicode grapheme. The rest of them will get negative 
numbers assigned, one at a time. The negative numbers will only be 
meaningful to the string they're in, or maybe only to the particular 
execution context. (There are issues with comparing, etc. Which is why I 
think maybe one table per execution.)


5. The result is that every grapheme (letter-on-the-page) will have a 
single number behind it, will have a length of 1, etc. So we can do 
meaningful substr($str, 2, 7) and get what we expect, even when the 
fifth grapheme requires a base character plus 4 combining marks.


All hail @Larry!

=Austin


Mark J. Reed wrote:

Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote:
  

Darren Duncan wrote:



Since you seem eager, I recommend you start with porting the Parrot PDD
28 to a new Perl 6 Synopsis 15, and continue from there.
  

IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer




  




Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
 If you haven't read the PDD, it's a good start.

snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

-- 
Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Mark J. Reed wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

  


There's a couple of cases. First of all, it doesn't have to be an 
integer. It needs to be a fixed size, and it needs to be orderable, so 
that we can store a bunch of them in an intelligent fashion, thus making 
it easy to sort them.


With that said, integers meet the need exactly. Plus, there's the 
benefit that unicode already has an escape hatch built in to it for 
user-defined stuff. And that escape hatch is an integer.


The benefits are documented in the pod: they're fixed size, so we can 
scan over them forward and backward at low cost. They're easily 
distinguished (high bit set) so string code can special-case them 
quickly. They're orderable, comparable, etc. And best of all they 
contain no trans fat!


=Austin




Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 09:21 , Mark J. Reed wrote:

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm



I would argue that if you are working with a grapheme cluster  
(grapheme), arithmetic on individual grapheme values is undefined.   
What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
DOT BELOW]) + 1?  If you say it increments the base character (a  
reasonable-looking initial stance), what happens if I add an amount  
which changes the base character to a combining character?  And what  
happens if the original grapheme doesn't have a base character?


In short, I think the only remotely sane result of ord() on a grapheme  
is an opaque value meaningful to chr() but to very little, if  
anything, else.  If you want to represent it as an integer, fine, but  
it should be obscured such that math isn't possible on it.   
Conversely, if you want ord() values you can manipulate, you must work  
at the codepoint level.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 On May 18, 2009, at 09:21 , Mark J. Reed wrote:
 If you're doing arithmetic with the code points or scalar values of
 characters, then the specific numbers would seem to matter.  I'm


 I would argue that if you are working with a grapheme cluster  
 (grapheme), arithmetic on individual grapheme values is undefined.   
 What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
 DOT BELOW]) + 1?  If you say it increments the base character (a  
 reasonable-looking initial stance), what happens if I add an amount  
 which changes the base character to a combining character?  And what  
 happens if the original grapheme doesn't have a base character?

 In short, I think the only remotely sane result of ord() on a grapheme  
 is an opaque value meaningful to chr() but to very little, if anything, 
 else.  If you want to represent it as an integer, fine, but it should be 
 obscured such that math isn't possible on it.  Conversely, if you want 
 ord() values you can manipulate, you must work at the codepoint level.

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

As an implementation detail however, it's important to note that
the signed/unsigned distinction gives us a great deal of latitude
in how to store a particular sequence of integers.  Latin-1 will (by
definition) fit in a *uint8, while ASCII plus (no more that 128) NFG
negatives will fit into *int8.  Most European languages will fit into
*int16 with up to 32768 synthetic chars.  Most Asian text still fits
into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)

Note also that uint8 has nothing to do with UTF-8, and uint16 has
nothing to do with UTF-16.  Surrogate pairs are represented by a single
integer in NFG.  That is, NFG is always abstract codepoints of some
sort without regard to the underlying representation.  In that sense
it's not important that synthetic codepoints are negative, of course.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
 On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 I would argue that if you are working with a grapheme cluster
 (grapheme), arithmetic on individual grapheme values is undefined.

Yup, that was exactly what I was arguing.

 In short, I think the only remotely sane result of ord() on a grapheme
 is an opaque value meaningful to chr() but to very little, if anything,
 else.

Which is what we have with the negative integer spec.  What I dislike
is the transient, handlish nature of those values: like a handle, you
can't store the value and then use it to reconstruct the grapheme
later.  But since actually storing the grapheme itself should be no
great feat, I guess that's not much of a hardship.

On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote:
 you can already write complete ord/chr nonsense at the codepoint level (even 
 in ASCII)

Sorry, could you clarify what you mean by that?

 And we can  always resort to *uint32 and *int32 knowing that the Unicode 
consortium
 isn't going to use the top bit any time in the foreseeable future.

s/top bit/top 11 bits/...

 Note also that uint8 has nothing to do with UTF-8, and uint16 has
 nothing to do with UTF-16.  Surrogate pairs are represented by a single
 integer in NFG.

They are also represented by a single value in UTF-8; that is, the
full scalar value is encoded directly, rather being first encoded into
UTF-16 surrogates which are then encoded as UTF-8...

 That is, NFG is always abstract codepoints of some sort

Barely-relevant terminology nit: abstract code points sounds like
something that would be associated with abstract characters, which
as defined in Unicode are formally distinct from graphemes, which is
what we're talking about here.

Also, the term code points includes the surrogates, which can only
appear in UTF-16; I imagine the scalar values we deal with most of the
time at the character/grapheme level would be the subset of code
points excluding surrogates, which are called Unicode scalar values.

Surrogates are just weird, since they have assigned code points even
though they're purely an encoding mechanism.  As such, they straddle
the line between abstract characters and an encoding form. I assume
that if text comes in as UTF-16, the surrogates will disappear as far
as character-level P6 code is concerned.  So is there any way for P6
to manipulate surrogates as characters?  Maybe an adverb or trait?
Or does one have to descend to the bytewise layer for that?  (As you
said, that *normally* shouldn't be necessary outside encoding and
decoding, where you need to do things bytewise anyway; just trying to
cover all the bases...)
-- 
Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:
 [1] Open questions:

 1) Will graphemes have an unique charname?
e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE

Yes, presumably that comes with the normalization part of NFG.
We're not aiming for round-tripping of synthetic codepoints, just
as NFC doesn't do round-tripping of sequences that have precomposed
codepoints.  We're really just extending the NFC notion a bit further
to encompass temporary precomposed codepoints.

 2) Can I use Unicode property matching safely with graphemes?
If yes, who or what maintains the necessary tables?

Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?

 3) Details of 'life-time', round-trip.

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).

 4) Should the definition of graphemes conform to Unicode Standard Annex  
 #29 'grapheme clusters'? Wich level - legacy, extended or tailored?

No opinion, other than that we're aiming for the most modern
formulation that doesn't implicitly cede declarational control to
something out of the control of Perl 6 declarations.  (See locales for
an example of something Perl 6 ignores in the absence of an explicit
declaration to pay attention to them.)  So just guessing from the
names without reading the Annex in question, not legacy, but probably
extended, with explicitly tailoring allowed by declaration.  (Unless
extended has some dire performance or policy consequences that would
be contraindicative...)

So as long as we stay inside these fundamental Perl 6 design
principles, feel free to whack on the specs.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote:
: Surrogates are just weird, since they have assigned code points even
: though they're purely an encoding mechanism.  As such, they straddle
: the line between abstract characters and an encoding form. I assume
: that if text comes in as UTF-16, the surrogates will disappear as far
: as character-level P6 code is concerned.

I devoutly hope so.  UTF-8 is much cleaner than UTF-16 in this regard.
(And it's why I qualified my code point with abstract earlier, to
mean the UTF-8 interpretion rather than the UTF-16 interpretation.)

: So is there any way for P6
: to manipulate surrogates as characters?  Maybe an adverb or trait?
: Or does one have to descend to the bytewise layer for that?  (As you
: said, that *normally* shouldn't be necessary outside encoding and
: decoding, where you need to do things bytewise anyway; just trying to
: cover all the bases...)

Buf16 should work for raw UTF-16 just fine.  That's one of the main
reasons we have buffers in sizes other than 8, after all.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway;  
specifically I'm contemplating Erlang-style services.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Brandon S. Allbery KF8NH wrote:

On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway; 
specifically I'm contemplating Erlang-style services.


Why wouldn't a marshalling of an NFG string automatically include the 
grapheme table? That way you can realize it and immediately use it in 
fast mode. Alternatively, if you were providing a persistent string 
service, a post-marshalling step could re-normalize it in local NFG.


The response in NFG could either use the same table you sent (if the 
response is a subset of the original string) or could attach its own 
table for translation at your end.


=Austin



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Larry Wall wrote:

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).
  


I think that a DoS attack on Unicode would be called IBM/Windows Code 
Pages. The rest of the world have been suffering this attack for the 
last 40 years. I'm not sure anyone would notice, at this point. :-)


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.
  


My feelings, in general.  It appears that the concept of mapping total 
graphemes to integers, negative, etc. is an implementation decision.  
Perl 6 strings has a concept of graphemes, and functions that work with 
them.  But the core language specification should keep that as general 
as possible, and allow implementation freedom.  The statement that base 
moda modb produces the same grapheme value as base modb moda is at 
the correct level.  The statement the grapheme is an Int is not only 
at the wrong level, but not right, as they should be their own distinct 
type.  I think that the PDD details of assigning negative values as 
encountered AND the idea of being a list of code points in some 
normalized form, AND the idea of it being a buffer of bytes in UTF8 with 
that list of code points encoded therein, are all *allowed* as correct 
implementations.  So is having a type whose instance data stores it in 
however many forms it wants, and for the Perl end of things you just let 
the === operator take its natural course.



If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.



Well, you can view a string as bytes of UTF8, code points, or 
graphemes.  If you want numbers you probably wanted the first two.  A 
grapheme object should in some ways behave as a string of 1 grapheme and 
allow you to obtain bytes of UTF8 or code points, easily. 

Now object identity, the address of an object, is not mandated to be 
an Int or even numeric.  Different types can return different things 
even.  The only thing we know is that infix:=== uses them.


Should graphemes be any different?  A grapheme object has observed 
behavior (encode it as...) and internal unobserved behavior.  Perhaps 
we need more assertions such as saying that it can serve as hash keys 
properly, rather than going all the way to saying that they must be 
numbered.  Especially with an internal numbering system that changes 
from run to run!


Meanwhile... that's what the Str class does.  It still has nothing to do 
with how source code is parsed.  To that extent, mentioning it in S02, 
at least in that section, is a mistake.  A see-also to general Perl 
Unicode documentation would not be objectionable.


Also, I described more detailed, formal handling of the input stream to 
the Perl 6 parser last year:  http://www.dlugosz.com/Perl6/specdoc.pdf 
in Section 3.1.  It was discussed on this mailing list when I was 
starting it.


--John



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Larry Wall larry-at-wall.org |Perl 6| wrote:

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

  



Playing the Devil's Advocate here, some other discussion on this thread 
made me think of something.  People already write code that expects 
ord's to be ordered.  Instead of saying, well, use code points if you 
want to do that we can encourage people to embrace graphemes and say 
don't use code points or bytes!  Use graphemes! if they behave in a 
familiar enough manner.


So on one hand I say viva la revolution!, graphemes are modeled after 
the object identity, which is totally opaque except for equality 
testing.  But on the other hand, I want to say they may be funky 
inside, but you can still _use_ them in the ways you want... and assure 
that they work as hash keys and are not only ordered but include ASCII 
ordering as a subgroup.  But, still not disallow any good implementation 
ideas that befit totally different implementations.


Of course, that's not a problem unique to graphemes.  The object 
identity keys, for example.  Any forward-thinking that replaces old 
values with magic cookies.  Perhaps we need a general class that will 
assign orderable tags to arbitrary values and remember the mapping, and 
use that for more general cases.  It can be explicitly specialized to 
use any implementation-dependent ordering that actually exists on that 
type, and the general case would just be to memo-ize an int mapping.


--John


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Larry Wall larry-at-wall.org |Perl 6| wrote:

into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)
  


No, a few million code points in the Unicode standard can produce an 
arbitrary number of unique grapheme clusters, since you can apply as 
many modifiers as you like to each different base character.  If you 
allow multiples, the total is unbounded.


A small program, which ought to go into the test suite g, can generate 
4G distinct grapheme clusters, one at a time. 

How many implementations will that break?  If they want fixed size, 
64-bits should do for now.  Also, if the spec doesn't list a requirement 
for a minimum implement ion limit, *any* fixed-size implementation will 
be incorrect even if untestable as such.


--John



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:
 No, a few million code points in the Unicode standard can produce an  
 arbitrary number of unique grapheme clusters, since you can apply as  
 many modifiers as you like to each different base character.  If you  
 allow multiples, the total is unbounded.

 A small program, which ought to go into the test suite g, can generate  
 4G distinct grapheme clusters, one at a time. 

That precise behavior is what I was characterizing as a DoS attack.  :)

So in my head it falls into the Doctor-it-hurts-when-I-do-this category.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 21:54 , Larry Wall wrote:

On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:

No, a few million code points in the Unicode standard can produce an
arbitrary number of unique grapheme clusters, since you can apply as
many modifiers as you like to each different base character.  If you
allow multiples, the total is unbounded.

A small program, which ought to go into the test suite g, can  
generate

4G distinct grapheme clusters, one at a time.


That precise behavior is what I was characterizing as a DoS  
attack.  :)
So in my head it falls into the Doctor-it-hurts-when-I-do-this  
category.



If you're working with externally generated Unicode, you may not have  
that option.  I've gotten some bizarre combinations out of Word in  
Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact,  
that in the end I used gedit on FreeBSD).


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part


Re: Unicode in 'NFG' formation ?

2009-05-16 Thread Darren Duncan

John M. Dlugosz wrote:
I was going over S02, and found it opens with, By default Perl presents 
Unicode in NFG formation, where each grapheme counts as one character.


I looked up NFG, and found it to be an invention of this group, but 
didn't find any details when I tried to chase down the links.


This opens a whole bunch of questions for me.  If you mean that the 
default for what the individual items in a string are is graphemes, OK, 
but what does that have to do with parsing source code?  Even so, that's 
not something that would be called a Normalization Form.


Character set encodings and stuff is one of my strengths.  I'd like to 
straighten this out, and can certainly straighten out the wording, but 
first need to know what you meant by that.


Can someone catch me up on the particulars?


I noticed and asked about this a few months ago.  As you say, NFG was invented 
for Perl 6 and/or Parrot.


See http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html 
for all the formal details that exist to my knowledge.


Back at the time I raised the issue, it was said that we need to take that 
Parrot PDD 28 and derive the initial Perl 6 Synopsis 15 from it.  Such a 
Synopsis could basically just start out as a clone of the Parrot document.  I 
said that someday I might have the round-tuit for this, but as yet I didn't.


Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a 
new Perl 6 Synopsis 15, and continue from there.


-- Darren Duncan