Re: Unicode in 'NFG' formation ?
John M. Dlugosz wrote: I was going over S02, and found it opens with, By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links. As Durran already wrote, the only definition is in http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html which references 'Unicode Normalization Forms' http://www.unicode.org/reports/tr15/. Also there is a reference to The Unicode Standard defines a grapheme cluster (commonly simplified to just grapheme). IMHO the authors meant this document: Unicode Standard Annex #29 Unicode Text Segmentation http://unicode.org/reports/tr29/ This opens a whole bunch of questions for me. I have many unanswered questions [1] about graphemes. If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code? First - nothing. S01: Perl 6 is written in Unicode. Developers can choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl source code. Characters outside the ASCII range can be used for identifiers, literals, and syntactic punctuation (e.g. 'bracketing pairs'). It's a problem of the parser to handle it correctly. Even so, that's not something that would be called a Normalization Form. Not in Unicode, but it can be called Grapheme Composition. Thus \c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE] \c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE] should all lead to the same grapheme (my personal assumption). Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that. What's specified: 1) A grapheme is 1 character, thus has 'length' 1. 2) A grapheme has a unique internal representation as an integer for some life-time (process), outside the Unicode codepoints. 3) Graphemes can be normalized to NFD, NFC etc. [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? 3) Details of 'life-time', round-trip. 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Do we really need to be able to map arbitrary graphemes to integers, or is it enough to have an opaque value returned by ord() that, when fed to chr(), returns the same grapheme? If the latter, a list of code points (in one of the official Normalzation Formats) would seem to be sufficient. On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote: Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer -- Sent from my mobile device Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
If you haven't read the PDD, it's a good start. To summarize, probably oversimplifying badly: 1. A grapheme is a character *as seen on the page.* That is, if composing a + dot above + dot below produces an a with dots above and below it, then THAT is the grapheme. 2. Unicode has a lot of characters that are single code points representing a complex grapheme. For example, the A + ring above composition shows up as the Angstrom symbol. 3. But on the other hand, some combination of basic characters plus combining marks DO NOT have a single code point that represents them. For example, while your girlfriend might compose dotless lowercase i with combining heart above to produce an i with a heart instead of a dot, there isn't a single codepoint in Unicode for that. (Unless girly-grrls got their own code page. Maybe in Unicode 6...) 4. Since that's a considerable PITA to deal with, we now have NFG format, which really should have been called NFW format, IMO. (W = widechars, natch.) Every combination of basic plus combining marks *that gets used* will have a single grapheme allocated. Many of them, like the Angstrom symbol, or O + combining röckdöts, will already have a real unicode grapheme. The rest of them will get negative numbers assigned, one at a time. The negative numbers will only be meaningful to the string they're in, or maybe only to the particular execution context. (There are issues with comparing, etc. Which is why I think maybe one table per execution.) 5. The result is that every grapheme (letter-on-the-page) will have a single number behind it, will have a length of 1, etc. So we can do meaningful substr($str, 2, 7) and get what we expect, even when the fifth grapheme requires a base character plus 4 combining marks. All hail @Larry! =Austin Mark J. Reed wrote: Do we really need to be able to map arbitrary graphemes to integers, or is it enough to have an opaque value returned by ord() that, when fed to chr(), returns the same grapheme? If the latter, a list of code points (in one of the official Normalzation Formats) would seem to be sufficient. On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote: Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
Mark J. Reed wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. There's a couple of cases. First of all, it doesn't have to be an integer. It needs to be a fixed size, and it needs to be orderable, so that we can store a bunch of them in an intelligent fashion, thus making it easy to sort them. With that said, integers meet the need exactly. Plus, there's the benefit that unicode already has an escape hatch built in to it for user-defined stuff. And that escape hatch is an integer. The benefits are documented in the pod: they're fixed size, so we can scan over them forward and backward at low cost. They're easily distinguished (high bit set) so string code can special-case them quickly. They're orderable, comparable, etc. And best of all they contain no trans fat! =Austin
Re: r26868 - docs/Perl6/Spec
On Mon, May 18, 2009 at 07:01:27AM +0200, pugs-comm...@feather.perl6.nl wrote: : Author: jdlugosz : Date: 2009-05-18 07:01:27 +0200 (Mon, 18 May 2009) : New Revision: 26868 : : Modified: :docs/Perl6/Spec/S03-operators.pod : Log: : Fix one typo, s/know/known/. Really just low-hanging fruit to test my Commit access and procedures therein. I'm assuming that the VERSION block is updated manually before checking in, and all versions are numbered sequentially even if a typographic change. It's fine to change the version on a typo, but no big deal if you forget, and sometimes I forget on purpose if it's right after the original checkin that introduced the typo, especially if it's my own typo. :) Larry
Re: is value trait
On Sun, May 17, 2009 at 09:35:50PM +0200, Moritz Lenz wrote: : Hi, : : t/oo/value_types.t mentions the is value trait, which doesn't appear : in the spec anywhere. According to the discussion in [1] there was : speculation about 'is cow' and 'is value', but the former didn't seem to : enter the spec either. : : So what should I do about that test? Simply delete it? Yes, unless someone can think of a reason not to. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. As an implementation detail however, it's important to note that the signed/unsigned distinction gives us a great deal of latitude in how to store a particular sequence of integers. Latin-1 will (by definition) fit in a *uint8, while ASCII plus (no more that 128) NFG negatives will fit into *int8. Most European languages will fit into *int16 with up to 32768 synthetic chars. Most Asian text still fits into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. That is, NFG is always abstract codepoints of some sort without regard to the underlying representation. In that sense it's not important that synthetic codepoints are negative, of course. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. Yup, that was exactly what I was arguing. In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. Which is what we have with the negative integer spec. What I dislike is the transient, handlish nature of those values: like a handle, you can't store the value and then use it to reconstruct the grapheme later. But since actually storing the grapheme itself should be no great feat, I guess that's not much of a hardship. On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote: you can already write complete ord/chr nonsense at the codepoint level (even in ASCII) Sorry, could you clarify what you mean by that? And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. s/top bit/top 11 bits/... Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. They are also represented by a single value in UTF-8; that is, the full scalar value is encoded directly, rather being first encoded into UTF-16 surrogates which are then encoded as UTF-8... That is, NFG is always abstract codepoints of some sort Barely-relevant terminology nit: abstract code points sounds like something that would be associated with abstract characters, which as defined in Unicode are formally distinct from graphemes, which is what we're talking about here. Also, the term code points includes the surrogates, which can only appear in UTF-16; I imagine the scalar values we deal with most of the time at the character/grapheme level would be the subset of code points excluding surrogates, which are called Unicode scalar values. Surrogates are just weird, since they have assigned code points even though they're purely an encoding mechanism. As such, they straddle the line between abstract characters and an encoding form. I assume that if text comes in as UTF-16, the surrogates will disappear as far as character-level P6 code is concerned. So is there any way for P6 to manipulate surrogates as characters? Maybe an adverb or trait? Or does one have to descend to the bytewise layer for that? (As you said, that *normally* shouldn't be necessary outside encoding and decoding, where you need to do things bytewise anyway; just trying to cover all the bases...) -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE Yes, presumably that comes with the normalization part of NFG. We're not aiming for round-tripping of synthetic codepoints, just as NFC doesn't do round-tripping of sequences that have precomposed codepoints. We're really just extending the NFC notion a bit further to encompass temporary precomposed codepoints. 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? No opinion, other than that we're aiming for the most modern formulation that doesn't implicitly cede declarational control to something out of the control of Perl 6 declarations. (See locales for an example of something Perl 6 ignores in the absence of an explicit declaration to pay attention to them.) So just guessing from the names without reading the Annex in question, not legacy, but probably extended, with explicitly tailoring allowed by declaration. (Unless extended has some dire performance or policy consequences that would be contraindicative...) So as long as we stay inside these fundamental Perl 6 design principles, feel free to whack on the specs. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote: : Surrogates are just weird, since they have assigned code points even : though they're purely an encoding mechanism. As such, they straddle : the line between abstract characters and an encoding form. I assume : that if text comes in as UTF-16, the surrogates will disappear as far : as character-level P6 code is concerned. I devoutly hope so. UTF-8 is much cleaner than UTF-16 in this regard. (And it's why I qualified my code point with abstract earlier, to mean the UTF-8 interpretion rather than the UTF-16 interpretation.) : So is there any way for P6 : to manipulate surrogates as characters? Maybe an adverb or trait? : Or does one have to descend to the bytewise layer for that? (As you : said, that *normally* shouldn't be necessary outside encoding and : decoding, where you need to do things bytewise anyway; just trying to : cover all the bases...) Buf16 should work for raw UTF-16 just fine. That's one of the main reasons we have buffers in sizes other than 8, after all. Larry
Re: each() comprehension
On Sun, May 17, 2009 at 07:41:45PM +0200, Moritz Lenz wrote: : Hi, : : (sorry for yet another p6l email mentioning junctions; if they annoy you : just ignore this mail :-) : : while reviewing some tests I found the each() comprehension in S02 : that evaded my attention so far. : : Do we really want to keep such a rather obscure syntactic : transformation? I find an explicit grep much more readable; if we want : it to work in a more general case, it might become some kind of junction : that, on autothreading, keeps a mapping between the original item and : the new value, and on collapse returns all items for which the new value : is true. Something along these lines: : : g(f(each(1..3))9 : becomes : g(each(1 = f(1), 2 = f(2), 3 = f(3))) : becomes : each(1 = g(f(1)), 2 = g(f(2)), 3 = (g(f(3))) : and on collapse returns : 1..3.grep:{g(f($_))}; : : IMHO this would DWIM more in arbitrary code than the special syntactic : form envisioned Feel free either to whack it out and/or install each() as a conjectural mapping junction that may be deferred till post-6.0.0. : Also this part of S02 is rather obscure, IMHO: : : In particular, : : @result = each(@x) ~~ {...}; : : is equivalent to : : @result = @x.grep:{...}; : : Should it be @result = @x.grep:{ $_ ~~ ... } instead? Otherwise : : 'each(@x) ~~ 1..3' would be transformed into '@x.grep:{1..3}', which : would return the full list. (Or do adverbial blocks some magic smart : matching that I'm not aware of?) The grep itself does the smart matching: @dogs = grep Dog, @mammals; Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. Why wouldn't a marshalling of an NFG string automatically include the grapheme table? That way you can realize it and immediately use it in fast mode. Alternatively, if you were providing a persistent string service, a post-marshalling step could re-normalize it in local NFG. The response in NFG could either use the same table you sent (if the response is a subset of the original string) or could attach its own table for translation at your end. =Austin
Re: Unicode in 'NFG' formation ?
Larry Wall wrote: Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I think that a DoS attack on Unicode would be called IBM/Windows Code Pages. The rest of the world have been suffering this attack for the last 40 years. I'm not sure anyone would notice, at this point. :-)
r26876 - docs/Perl6/Spec
Author: moritz Date: 2009-05-18 23:08:54 +0200 (Mon, 18 May 2009) New Revision: 26876 Modified: docs/Perl6/Spec/S02-bits.pod docs/Perl6/Spec/S09-data.pod Log: [S02] get rid of the each() comprehension [S09] document speculative each() junction with grep semantics Modified: docs/Perl6/Spec/S02-bits.pod === --- docs/Perl6/Spec/S02-bits.pod2009-05-18 18:22:24 UTC (rev 26875) +++ docs/Perl6/Spec/S02-bits.pod2009-05-18 21:08:54 UTC (rev 26876) @@ -3564,32 +3564,6 @@ =item * -When evaluating chained operators, if an Ceach() occurs anywhere in that -chain, the chain will be transformed first into a Cgrep. That is, - -for 0 = each(@x) all(@y) {...} - -becomes - -for @x.grep:{ 0 = $_ all(@y) } {...} - -Because of this, the original ordering C@x is guaranteed to be -preserved in the returned list, and duplicate elements in C@x are -preserved as well. In particular, - -@result = each(@x) ~~ {...}; - -is equivalent to - -@result = @x.grep:{...}; - -However, this Ieach() comprehension is strictly a syntactic transformation, -so a list computed any other way will not trigger the rewrite: - -@result = (@x = each(@y)) ~~ {...}; # not a comprehension - -=item * - The C| prefix operator may be used to force capture context on its argument and Ialso defeat any scalar argument checking imposed by subroutine signature declarations. Any resulting list arguments are Modified: docs/Perl6/Spec/S09-data.pod === --- docs/Perl6/Spec/S09-data.pod2009-05-18 18:22:24 UTC (rev 26875) +++ docs/Perl6/Spec/S09-data.pod2009-05-18 21:08:54 UTC (rev 26876) @@ -1057,6 +1057,18 @@ please limit use of junctions to situations where the eventual binding to a scalar formal parameter is clear. +(Conjucture: in post-Perl 6.0.0 we might introduce an Ceach() +junction which keeps track of its initial values, returning on collapse +those initial values which transformed into a true value, for example + +each(2, 3, 4) - 3 + +would return an unordered collection consisting of 2 and 4, because +C2-3 and C4-3 are True in boolean context, while C3-3 is False. +However it is not yet clear if we really want that, and if yes, in which +context the collapse will occur). + + =head1 Parallelized parameters and autothreading Within the scope of a Cuse autoindex pragma (or equivalent, such as
Re: Unicode in 'NFG' formation ?
Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. My feelings, in general. It appears that the concept of mapping total graphemes to integers, negative, etc. is an implementation decision. Perl 6 strings has a concept of graphemes, and functions that work with them. But the core language specification should keep that as general as possible, and allow implementation freedom. The statement that base moda modb produces the same grapheme value as base modb moda is at the correct level. The statement the grapheme is an Int is not only at the wrong level, but not right, as they should be their own distinct type. I think that the PDD details of assigning negative values as encountered AND the idea of being a list of code points in some normalized form, AND the idea of it being a buffer of bytes in UTF8 with that list of code points encoded therein, are all *allowed* as correct implementations. So is having a type whose instance data stores it in however many forms it wants, and for the Perl end of things you just let the === operator take its natural course. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. Well, you can view a string as bytes of UTF8, code points, or graphemes. If you want numbers you probably wanted the first two. A grapheme object should in some ways behave as a string of 1 grapheme and allow you to obtain bytes of UTF8 or code points, easily. Now object identity, the address of an object, is not mandated to be an Int or even numeric. Different types can return different things even. The only thing we know is that infix:=== uses them. Should graphemes be any different? A grapheme object has observed behavior (encode it as...) and internal unobserved behavior. Perhaps we need more assertions such as saying that it can serve as hash keys properly, rather than going all the way to saying that they must be numbered. Especially with an internal numbering system that changes from run to run! Meanwhile... that's what the Str class does. It still has nothing to do with how source code is parsed. To that extent, mentioning it in S02, at least in that section, is a mistake. A see-also to general Perl Unicode documentation would not be objectionable. Also, I described more detailed, formal handling of the input stream to the Perl 6 parser last year: http://www.dlugosz.com/Perl6/specdoc.pdf in Section 3.1. It was discussed on this mailing list when I was starting it. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. Playing the Devil's Advocate here, some other discussion on this thread made me think of something. People already write code that expects ord's to be ordered. Instead of saying, well, use code points if you want to do that we can encourage people to embrace graphemes and say don't use code points or bytes! Use graphemes! if they behave in a familiar enough manner. So on one hand I say viva la revolution!, graphemes are modeled after the object identity, which is totally opaque except for equality testing. But on the other hand, I want to say they may be funky inside, but you can still _use_ them in the ways you want... and assure that they work as hash keys and are not only ordered but include ASCII ordering as a subgroup. But, still not disallow any good implementation ideas that befit totally different implementations. Of course, that's not a problem unique to graphemes. The object identity keys, for example. Any forward-thinking that replaces old values with magic cookies. Perhaps we need a general class that will assign orderable tags to arbitrary values and remember the mapping, and use that for more general cases. It can be explicitly specialized to use any implementation-dependent ordering that actually exists on that type, and the general case would just be to memo-ize an int mapping. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. How many implementations will that break? If they want fixed size, 64-bits should do for now. Also, if the spec doesn't list a requirement for a minimum implement ion limit, *any* fixed-size implementation will be incorrect even if untestable as such. --John
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 21:54 , Larry Wall wrote: On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. If you're working with externally generated Unicode, you may not have that option. I've gotten some bizarre combinations out of Word in Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact, that in the end I used gedit on FreeBSD). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part