Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer

John M. Dlugosz wrote:
I was going over S02, and found it opens with, By default Perl presents 
Unicode in NFG formation, where each grapheme counts as one character.


I looked up NFG, and found it to be an invention of this group, but 
didn't find any details when I tried to chase down the links.


As Durran already wrote, the only definition is in 
http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html 
 which references 'Unicode Normalization Forms' 
http://www.unicode.org/reports/tr15/.


Also there is a reference to
The Unicode Standard defines a grapheme cluster (commonly simplified to 
just grapheme). IMHO the authors meant this document:


 Unicode Standard Annex #29
 Unicode Text Segmentation
 http://unicode.org/reports/tr29/

This opens a whole bunch of questions for me.  


I have many unanswered questions [1] about graphemes.

If you mean that the 
default for what the individual items in a string are is graphemes, OK, 
but what does that have to do with parsing source code?  


First - nothing. S01: Perl 6 is written in Unicode. Developers can 
choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl 
source code. Characters outside the ASCII range can be used for 
identifiers, literals, and syntactic punctuation (e.g. 'bracketing 
pairs').


It's a problem of the parser to handle it correctly.

Even so, that's 
not something that would be called a Normalization Form.


Not in Unicode, but it can be called Grapheme Composition.

Thus

\c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE]
\c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW]
\c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE]

should all lead to the same grapheme (my personal assumption).

Character set encodings and stuff is one of my strengths.  I'd like to 
straighten this out, and can certainly straighten out the wording, but 
first need to know what you meant by that.


What's specified:
1) A grapheme is 1 character, thus has 'length' 1.
2) A grapheme has a unique internal representation as an integer for 
some life-time (process), outside the Unicode codepoints.

3) Graphemes can be normalized to NFD, NFC etc.

[1] Open questions:

1) Will graphemes have an unique charname?
   e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE
2) Can I use Unicode property matching safely with graphemes?
   If yes, who or what maintains the necessary tables?
3) Details of 'life-time', round-trip.
4) Should the definition of graphemes conform to Unicode Standard Annex 
#29 'grapheme clusters'? Wich level - legacy, extended or tailored?


Helmut Wollmersdorfer




Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Helmut Wollmersdorfer

Darren Duncan wrote:

Since you seem eager, I recommend you start with porting the Parrot PDD 
28 to a new Perl 6 Synopsis 15, and continue from there.


IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote:
 Darren Duncan wrote:

 Since you seem eager, I recommend you start with porting the Parrot PDD
 28 to a new Perl 6 Synopsis 15, and continue from there.

 IMHO we need some people for a broad discussion on the details first.

 Helmut Wollmersdorfer


-- 
Sent from my mobile device

Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

If you haven't read the PDD, it's a good start.

To summarize, probably oversimplifying badly:

1. A grapheme is a character *as seen on the page.*  That is, if 
composing a + dot above + dot below produces an a with dots above 
and below it, then THAT is the grapheme.


2. Unicode has a lot of characters that are single code points 
representing a complex grapheme. For example, the A + ring above 
composition shows up as the Angstrom symbol.


3. But on the other hand, some combination of basic characters plus 
combining marks DO NOT have a single code point that represents them. 
For example, while your girlfriend might compose dotless lowercase i 
with combining heart above to produce an i with a heart instead of a 
dot, there isn't a single codepoint in Unicode for that. (Unless 
girly-grrls got their own code page. Maybe in Unicode 6...)


4. Since that's a considerable PITA to deal with, we now have NFG 
format, which really should have been called NFW format, IMO. (W = 
widechars, natch.) Every combination of basic plus combining marks 
*that gets used* will have a single grapheme allocated. Many of them, 
like the Angstrom symbol, or O + combining röckdöts, will already 
have a real unicode grapheme. The rest of them will get negative 
numbers assigned, one at a time. The negative numbers will only be 
meaningful to the string they're in, or maybe only to the particular 
execution context. (There are issues with comparing, etc. Which is why I 
think maybe one table per execution.)


5. The result is that every grapheme (letter-on-the-page) will have a 
single number behind it, will have a length of 1, etc. So we can do 
meaningful substr($str, 2, 7) and get what we expect, even when the 
fifth grapheme requires a base character plus 4 combining marks.


All hail @Larry!

=Austin


Mark J. Reed wrote:

Do we really need to be able to map arbitrary graphemes to integers,
or is it enough to have an opaque value returned by ord() that, when
fed to chr(), returns the same grapheme?  If the latter, a list of
code points (in one of the official Normalzation Formats) would seem
to be sufficient.

On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote:
  

Darren Duncan wrote:



Since you seem eager, I recommend you start with porting the Parrot PDD
28 to a new Perl 6 Synopsis 15, and continue from there.
  

IMHO we need some people for a broad discussion on the details first.

Helmut Wollmersdorfer




  




Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
 If you haven't read the PDD, it's a good start.

snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

-- 
Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Mark J. Reed wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.

  


There's a couple of cases. First of all, it doesn't have to be an 
integer. It needs to be a fixed size, and it needs to be orderable, so 
that we can store a bunch of them in an intelligent fashion, thus making 
it easy to sort them.


With that said, integers meet the need exactly. Plus, there's the 
benefit that unicode already has an escape hatch built in to it for 
user-defined stuff. And that escape hatch is an integer.


The benefits are documented in the pod: they're fixed size, so we can 
scan over them forward and backward at low cost. They're easily 
distinguished (high bit set) so string code can special-case them 
quickly. They're orderable, comparable, etc. And best of all they 
contain no trans fat!


=Austin




Re: r26868 - docs/Perl6/Spec

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 07:01:27AM +0200, pugs-comm...@feather.perl6.nl wrote:
: Author: jdlugosz
: Date: 2009-05-18 07:01:27 +0200 (Mon, 18 May 2009)
: New Revision: 26868
: 
: Modified:
:docs/Perl6/Spec/S03-operators.pod
: Log:
: Fix one typo, s/know/known/.  Really just low-hanging fruit to test my Commit 
access and procedures therein.  I'm assuming that the VERSION block is updated 
manually before checking in, and all versions are numbered sequentially even if 
a typographic change.

It's fine to change the version on a typo, but no big deal if you
forget, and sometimes I forget on purpose if it's right after the
original checkin that introduced the typo, especially if it's my
own typo.  :)

Larry


Re: is value trait

2009-05-18 Thread Larry Wall
On Sun, May 17, 2009 at 09:35:50PM +0200, Moritz Lenz wrote:
: Hi,
: 
: t/oo/value_types.t mentions the is value trait, which doesn't appear
: in the spec anywhere. According to the discussion in [1] there was
: speculation about 'is cow' and 'is value', but the former didn't seem to
: enter the spec either.
: 
: So what should I do about that test? Simply delete it?

Yes, unless someone can think of a reason not to.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 09:21 , Mark J. Reed wrote:

If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm



I would argue that if you are working with a grapheme cluster  
(grapheme), arithmetic on individual grapheme values is undefined.   
What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
DOT BELOW]) + 1?  If you say it increments the base character (a  
reasonable-looking initial stance), what happens if I add an amount  
which changes the base character to a combining character?  And what  
happens if the original grapheme doesn't have a base character?


In short, I think the only remotely sane result of ord() on a grapheme  
is an opaque value meaningful to chr() but to very little, if  
anything, else.  If you want to represent it as an integer, fine, but  
it should be obscured such that math isn't possible on it.   
Conversely, if you want ord() values you can manipulate, you must work  
at the codepoint level.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 On May 18, 2009, at 09:21 , Mark J. Reed wrote:
 If you're doing arithmetic with the code points or scalar values of
 characters, then the specific numbers would seem to matter.  I'm


 I would argue that if you are working with a grapheme cluster  
 (grapheme), arithmetic on individual grapheme values is undefined.   
 What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING  
 DOT BELOW]) + 1?  If you say it increments the base character (a  
 reasonable-looking initial stance), what happens if I add an amount  
 which changes the base character to a combining character?  And what  
 happens if the original grapheme doesn't have a base character?

 In short, I think the only remotely sane result of ord() on a grapheme  
 is an opaque value meaningful to chr() but to very little, if anything, 
 else.  If you want to represent it as an integer, fine, but it should be 
 obscured such that math isn't possible on it.  Conversely, if you want 
 ord() values you can manipulate, you must work at the codepoint level.

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

As an implementation detail however, it's important to note that
the signed/unsigned distinction gives us a great deal of latitude
in how to store a particular sequence of integers.  Latin-1 will (by
definition) fit in a *uint8, while ASCII plus (no more that 128) NFG
negatives will fit into *int8.  Most European languages will fit into
*int16 with up to 32768 synthetic chars.  Most Asian text still fits
into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)

Note also that uint8 has nothing to do with UTF-8, and uint16 has
nothing to do with UTF-16.  Surrogate pairs are represented by a single
integer in NFG.  That is, NFG is always abstract codepoints of some
sort without regard to the underlying representation.  In that sense
it's not important that synthetic codepoints are negative, of course.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Mark J. Reed
 On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote:
 I would argue that if you are working with a grapheme cluster
 (grapheme), arithmetic on individual grapheme values is undefined.

Yup, that was exactly what I was arguing.

 In short, I think the only remotely sane result of ord() on a grapheme
 is an opaque value meaningful to chr() but to very little, if anything,
 else.

Which is what we have with the negative integer spec.  What I dislike
is the transient, handlish nature of those values: like a handle, you
can't store the value and then use it to reconstruct the grapheme
later.  But since actually storing the grapheme itself should be no
great feat, I guess that's not much of a hardship.

On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote:
 you can already write complete ord/chr nonsense at the codepoint level (even 
 in ASCII)

Sorry, could you clarify what you mean by that?

 And we can  always resort to *uint32 and *int32 knowing that the Unicode 
consortium
 isn't going to use the top bit any time in the foreseeable future.

s/top bit/top 11 bits/...

 Note also that uint8 has nothing to do with UTF-8, and uint16 has
 nothing to do with UTF-16.  Surrogate pairs are represented by a single
 integer in NFG.

They are also represented by a single value in UTF-8; that is, the
full scalar value is encoded directly, rather being first encoded into
UTF-16 surrogates which are then encoded as UTF-8...

 That is, NFG is always abstract codepoints of some sort

Barely-relevant terminology nit: abstract code points sounds like
something that would be associated with abstract characters, which
as defined in Unicode are formally distinct from graphemes, which is
what we're talking about here.

Also, the term code points includes the surrogates, which can only
appear in UTF-16; I imagine the scalar values we deal with most of the
time at the character/grapheme level would be the subset of code
points excluding surrogates, which are called Unicode scalar values.

Surrogates are just weird, since they have assigned code points even
though they're purely an encoding mechanism.  As such, they straddle
the line between abstract characters and an encoding form. I assume
that if text comes in as UTF-16, the surrogates will disappear as far
as character-level P6 code is concerned.  So is there any way for P6
to manipulate surrogates as characters?  Maybe an adverb or trait?
Or does one have to descend to the bytewise layer for that?  (As you
said, that *normally* shouldn't be necessary outside encoding and
decoding, where you need to do things bytewise anyway; just trying to
cover all the bases...)
-- 
Mark J. Reed markjr...@gmail.com


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:
 [1] Open questions:

 1) Will graphemes have an unique charname?
e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE

Yes, presumably that comes with the normalization part of NFG.
We're not aiming for round-tripping of synthetic codepoints, just
as NFC doesn't do round-tripping of sequences that have precomposed
codepoints.  We're really just extending the NFC notion a bit further
to encompass temporary precomposed codepoints.

 2) Can I use Unicode property matching safely with graphemes?
If yes, who or what maintains the necessary tables?

Good question.  My assumption is that adding marks to a character
doesn't change its fundamental nature.  What needs to be provided
other pass-through to the base character's properties?

 3) Details of 'life-time', round-trip.

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).

 4) Should the definition of graphemes conform to Unicode Standard Annex  
 #29 'grapheme clusters'? Wich level - legacy, extended or tailored?

No opinion, other than that we're aiming for the most modern
formulation that doesn't implicitly cede declarational control to
something out of the control of Perl 6 declarations.  (See locales for
an example of something Perl 6 ignores in the absence of an explicit
declaration to pay attention to them.)  So just guessing from the
names without reading the Annex in question, not legacy, but probably
extended, with explicitly tailoring allowed by declaration.  (Unless
extended has some dire performance or policy consequences that would
be contraindicative...)

So as long as we stay inside these fundamental Perl 6 design
principles, feel free to whack on the specs.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote:
: Surrogates are just weird, since they have assigned code points even
: though they're purely an encoding mechanism.  As such, they straddle
: the line between abstract characters and an encoding form. I assume
: that if text comes in as UTF-16, the surrogates will disappear as far
: as character-level P6 code is concerned.

I devoutly hope so.  UTF-8 is much cleaner than UTF-16 in this regard.
(And it's why I qualified my code point with abstract earlier, to
mean the UTF-8 interpretion rather than the UTF-16 interpretation.)

: So is there any way for P6
: to manipulate surrogates as characters?  Maybe an adverb or trait?
: Or does one have to descend to the bytewise layer for that?  (As you
: said, that *normally* shouldn't be necessary outside encoding and
: decoding, where you need to do things bytewise anyway; just trying to
: cover all the bases...)

Buf16 should work for raw UTF-16 just fine.  That's one of the main
reasons we have buffers in sizes other than 8, after all.

Larry


Re: each() comprehension

2009-05-18 Thread Larry Wall
On Sun, May 17, 2009 at 07:41:45PM +0200, Moritz Lenz wrote:
: Hi,
: 
: (sorry for yet another p6l email mentioning junctions; if they annoy you
: just ignore this mail :-)
: 
: while reviewing some tests I found the each() comprehension in S02
: that evaded my attention so far.
: 
: Do we really want to keep such a rather obscure syntactic
: transformation? I find an explicit grep much more readable; if we want
: it to work in a more general case, it might become some kind of junction
: that, on autothreading, keeps a mapping between the original item and
: the new value, and on collapse returns all items for which the new value
: is true. Something along these lines:
: 
:  g(f(each(1..3))9
: becomes
:  g(each(1 = f(1), 2 = f(2), 3 = f(3)))
: becomes
:  each(1 = g(f(1)), 2 = g(f(2)), 3 = (g(f(3)))
: and on collapse returns
:  1..3.grep:{g(f($_))};
: 
: IMHO this would DWIM more in arbitrary code than the special syntactic
: form envisioned

Feel free either to whack it out and/or install each() as a conjectural
mapping junction that may be deferred till post-6.0.0.

: Also this part of S02 is rather obscure, IMHO:
: 
:   In particular,
: 
: @result = each(@x) ~~ {...};
: 
:  is equivalent to
: 
: @result = @x.grep:{...};
: 
: Should it be @result = @x.grep:{ $_ ~~ ... } instead? Otherwise
: 
: 'each(@x) ~~ 1..3' would be transformed into '@x.grep:{1..3}', which
: would return the full list. (Or do adverbial blocks some magic smart
: matching that I'm not aware of?)

The grep itself does the smart matching:

@dogs = grep Dog, @mammals;

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway;  
specifically I'm contemplating Erlang-style services.


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Brandon S. Allbery KF8NH wrote:

On May 18, 2009, at 14:16 , Larry Wall wrote:

On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote:

3) Details of 'life-time', round-trip.


Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).


I find mysef wondering if they might need to be standardized anyway; 
specifically I'm contemplating Erlang-style services.


Why wouldn't a marshalling of an NFG string automatically include the 
grapheme table? That way you can realize it and immediately use it in 
fast mode. Alternatively, if you were providing a persistent string 
service, a post-marshalling step could re-normalize it in local NFG.


The response in NFG could either use the same table you sent (if the 
response is a subset of the original string) or could attach its own 
table for translation at your end.


=Austin



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Austin Hastings

Larry Wall wrote:

Which is a very interesting topic, with connections to type theory,
scope/domain management, and security issues (such as the possibility
of a DoS attack on the translation tables).
  


I think that a DoS attack on Unicode would be called IBM/Windows Code 
Pages. The rest of the world have been suffering this attack for the 
last 40 years. I'm not sure anyone would notice, at this point. :-)


r26876 - docs/Perl6/Spec

2009-05-18 Thread pugs-commits
Author: moritz
Date: 2009-05-18 23:08:54 +0200 (Mon, 18 May 2009)
New Revision: 26876

Modified:
   docs/Perl6/Spec/S02-bits.pod
   docs/Perl6/Spec/S09-data.pod
Log:
[S02] get rid of the each() comprehension
[S09] document speculative each() junction with grep semantics

Modified: docs/Perl6/Spec/S02-bits.pod
===
--- docs/Perl6/Spec/S02-bits.pod2009-05-18 18:22:24 UTC (rev 26875)
+++ docs/Perl6/Spec/S02-bits.pod2009-05-18 21:08:54 UTC (rev 26876)
@@ -3564,32 +3564,6 @@
 
 =item *
 
-When evaluating chained operators, if an Ceach() occurs anywhere in that
-chain, the chain will be transformed first into a Cgrep.  That is,
-
-for 0 = each(@x)  all(@y) {...}
-
-becomes
-
-for @x.grep:{ 0 = $_  all(@y) } {...}
-
-Because of this, the original ordering C@x is guaranteed to be
-preserved in the returned list, and duplicate elements in C@x are
-preserved as well.  In particular,
-
-@result = each(@x) ~~ {...};
-
-is equivalent to
-
-@result = @x.grep:{...};
-
-However, this Ieach() comprehension is strictly a syntactic transformation,
-so a list computed any other way will not trigger the rewrite:
-
-@result = (@x = each(@y)) ~~ {...}; # not a comprehension
-
-=item *
-
 The C| prefix operator may be used to force capture context on its
 argument and Ialso defeat any scalar argument checking imposed by
 subroutine signature declarations.  Any resulting list arguments are

Modified: docs/Perl6/Spec/S09-data.pod
===
--- docs/Perl6/Spec/S09-data.pod2009-05-18 18:22:24 UTC (rev 26875)
+++ docs/Perl6/Spec/S09-data.pod2009-05-18 21:08:54 UTC (rev 26876)
@@ -1057,6 +1057,18 @@
 please limit use of junctions to situations where the eventual binding
 to a scalar formal parameter is clear.
 
+(Conjucture: in post-Perl 6.0.0 we might introduce an Ceach()
+junction which keeps track of its initial values, returning on collapse
+those initial values which transformed into a true value, for example
+
+each(2, 3, 4) - 3
+
+would return an unordered collection consisting of 2 and 4, because
+C2-3 and C4-3 are True in boolean context, while C3-3 is False.
+However it is not yet clear if we really want that, and if yes, in which
+context the collapse will occur).
+
+
 =head1 Parallelized parameters and autothreading
 
 Within the scope of a Cuse autoindex pragma (or equivalent, such as



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote:

On Mon, May 18, 2009 at 9:11 AM, Austin Hastings
austin_hasti...@yahoo.com wrote:
  

If you haven't read the PDD, it's a good start.



snip useful summary

I get all that, really.  I still question the necessity of mapping
each grapheme to a single integer.  A single *value*, sure.
length($weird_grapheme) should always be 1, absolutely.  But why does
ord($weird_grapheme) have to be a *numeric* value?  If you convert to,
say, normalization form C and return a list of the scalar values so
obtained, that can be used in any context to reproduce the same
grapheme, with no worries about different processes coming up with
different assignments of arbitrary negative numbers to graphemes.
  


My feelings, in general.  It appears that the concept of mapping total 
graphemes to integers, negative, etc. is an implementation decision.  
Perl 6 strings has a concept of graphemes, and functions that work with 
them.  But the core language specification should keep that as general 
as possible, and allow implementation freedom.  The statement that base 
moda modb produces the same grapheme value as base modb moda is at 
the correct level.  The statement the grapheme is an Int is not only 
at the wrong level, but not right, as they should be their own distinct 
type.  I think that the PDD details of assigning negative values as 
encountered AND the idea of being a list of code points in some 
normalized form, AND the idea of it being a buffer of bytes in UTF8 with 
that list of code points encoded therein, are all *allowed* as correct 
implementations.  So is having a type whose instance data stores it in 
however many forms it wants, and for the Perl end of things you just let 
the === operator take its natural course.



If you're doing arithmetic with the code points or scalar values of
characters, then the specific numbers would seem to matter.  I'm
looking for the use case where the fact that it's an integer matters
but the specific value doesn't.



Well, you can view a string as bytes of UTF8, code points, or 
graphemes.  If you want numbers you probably wanted the first two.  A 
grapheme object should in some ways behave as a string of 1 grapheme and 
allow you to obtain bytes of UTF8 or code points, easily. 

Now object identity, the address of an object, is not mandated to be 
an Int or even numeric.  Different types can return different things 
even.  The only thing we know is that infix:=== uses them.


Should graphemes be any different?  A grapheme object has observed 
behavior (encode it as...) and internal unobserved behavior.  Perhaps 
we need more assertions such as saying that it can serve as hash keys 
properly, rather than going all the way to saying that they must be 
numbered.  Especially with an internal numbering system that changes 
from run to run!


Meanwhile... that's what the Str class does.  It still has nothing to do 
with how source code is parsed.  To that extent, mentioning it in S02, 
at least in that section, is a mistake.  A see-also to general Perl 
Unicode documentation would not be objectionable.


Also, I described more detailed, formal handling of the input stream to 
the Perl 6 parser last year:  http://www.dlugosz.com/Perl6/specdoc.pdf 
in Section 3.1.  It was discussed on this mailing list when I was 
starting it.


--John



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Larry Wall larry-at-wall.org |Perl 6| wrote:

Sure, but this is a weak argument, since you can already write complete
ord/chr nonsense at the codepoint level (even in ASCII), and all we're
doing here is making graphemes work more like codepoints in terms of
storage and indexing.  If people abuse it, they have only themselves
to blame for relying on what is essentially an implementation detail.
The whole point of ord is to cheat, so if they get caught cheating,
well, they just have to take their lumps.  In the age of Unicode,
ord and chr are pretty much irrelevant to most normal text processing
anyway except for encoders and decoders, so there's not a great deal
of point in labeling the integers as an opaque type, in my opinion.

  



Playing the Devil's Advocate here, some other discussion on this thread 
made me think of something.  People already write code that expects 
ord's to be ordered.  Instead of saying, well, use code points if you 
want to do that we can encourage people to embrace graphemes and say 
don't use code points or bytes!  Use graphemes! if they behave in a 
familiar enough manner.


So on one hand I say viva la revolution!, graphemes are modeled after 
the object identity, which is totally opaque except for equality 
testing.  But on the other hand, I want to say they may be funky 
inside, but you can still _use_ them in the ways you want... and assure 
that they work as hash keys and are not only ordered but include ASCII 
ordering as a subgroup.  But, still not disallow any good implementation 
ideas that befit totally different implementations.


Of course, that's not a problem unique to graphemes.  The object 
identity keys, for example.  Any forward-thinking that replaces old 
values with magic cookies.  Perhaps we need a general class that will 
assign orderable tags to arbitrary values and remember the mapping, and 
use that for more general cases.  It can be explicitly specialized to 
use any implementation-dependent ordering that actually exists on that 
type, and the general case would just be to memo-ize an int mapping.


--John


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread John M. Dlugosz

Larry Wall larry-at-wall.org |Perl 6| wrote:

into *uint16 as long as they don't synthesize codepoints.  And we can
always resort to *uint32 and *int32 knowing that the Unicode consortium
isn't going to use the top bit any time in the foreseeable future.
(Unless, of course, they endorse something resembling NFG. :)
  


No, a few million code points in the Unicode standard can produce an 
arbitrary number of unique grapheme clusters, since you can apply as 
many modifiers as you like to each different base character.  If you 
allow multiples, the total is unbounded.


A small program, which ought to go into the test suite g, can generate 
4G distinct grapheme clusters, one at a time. 

How many implementations will that break?  If they want fixed size, 
64-bits should do for now.  Also, if the spec doesn't list a requirement 
for a minimum implement ion limit, *any* fixed-size implementation will 
be incorrect even if untestable as such.


--John



Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Larry Wall
On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:
 No, a few million code points in the Unicode standard can produce an  
 arbitrary number of unique grapheme clusters, since you can apply as  
 many modifiers as you like to each different base character.  If you  
 allow multiples, the total is unbounded.

 A small program, which ought to go into the test suite g, can generate  
 4G distinct grapheme clusters, one at a time. 

That precise behavior is what I was characterizing as a DoS attack.  :)

So in my head it falls into the Doctor-it-hurts-when-I-do-this category.

Larry


Re: Unicode in 'NFG' formation ?

2009-05-18 Thread Brandon S. Allbery KF8NH

On May 18, 2009, at 21:54 , Larry Wall wrote:

On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote:

No, a few million code points in the Unicode standard can produce an
arbitrary number of unique grapheme clusters, since you can apply as
many modifiers as you like to each different base character.  If you
allow multiples, the total is unbounded.

A small program, which ought to go into the test suite g, can  
generate

4G distinct grapheme clusters, one at a time.


That precise behavior is what I was characterizing as a DoS  
attack.  :)
So in my head it falls into the Doctor-it-hurts-when-I-do-this  
category.



If you're working with externally generated Unicode, you may not have  
that option.  I've gotten some bizarre combinations out of Word in  
Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact,  
that in the end I used gedit on FreeBSD).


--
brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com
system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu
electrical and computer engineering, carnegie mellon universityKF8NH




PGP.sig
Description: This is a digitally signed message part