Re: Unicode in 'NFG' formation ?
Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? This will work in most cases, but e.g. not with the property ASCII_Hex_Digit. LATIN SMALL LETTER A is ASCII_Hex_Digit but GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE is_not ASCII_Hex_Digit I will try to generate some millions of cases based on nfc(nfd($string)) to find out the best inheritance rules. 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? No opinion, other than that we're aiming for the most modern formulation that doesn't implicitly cede declarational control to something out of the control of Perl 6 declarations. (See locales for an example of something Perl 6 ignores in the absence of an explicit declaration to pay attention to them.) So just guessing from the names without reading the Annex in question, not legacy, but probably extended, with explicitly tailoring allowed by declaration. (Unless extended has some dire performance or policy consequences that would be contraindicative...) Will look into ICU what's supported. So as long as we stay inside these fundamental Perl 6 design principles, feel free to whack on the specs. OK. Hopefully some Indic, Arabic and Asian natives review this. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE Yes, presumably that comes with the normalization part of NFG. We're not aiming for round-tripping of synthetic codepoints, just as NFC doesn't do round-tripping of sequences that have precomposed codepoints. We're really just extending the NFC notion a bit further to encompass temporary precomposed codepoints. Unique for asking for the name, not when specifying the name. Just as with the code-point order, any combination that means the same should give the same grapheme, just as if you had create the code point sequence first. Perhaps you are not realizing that the different classes of modifiers are independent. You could say DOT ABOVE AND DOT ABOVE and get the same thing as DOT BELOW and DOT ABOVE. 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? Depends on the property! Being a modifier, for example. A detailed look would be needed to decide which properties just pass through to the base char, which are enhanced (e.g. letter becomes letter with modifiers), which don't make sense, which are mostly OK but change sometimes, etc.
Re: Unicode in 'NFG' formation ?
John M. Dlugosz wrote: I was going over S02, and found it opens with, By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links. As Durran already wrote, the only definition is in http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html which references 'Unicode Normalization Forms' http://www.unicode.org/reports/tr15/. Also there is a reference to The Unicode Standard defines a grapheme cluster (commonly simplified to just grapheme). IMHO the authors meant this document: Unicode Standard Annex #29 Unicode Text Segmentation http://unicode.org/reports/tr29/ This opens a whole bunch of questions for me. I have many unanswered questions [1] about graphemes. If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code? First - nothing. S01: Perl 6 is written in Unicode. Developers can choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl source code. Characters outside the ASCII range can be used for identifiers, literals, and syntactic punctuation (e.g. 'bracketing pairs'). It's a problem of the parser to handle it correctly. Even so, that's not something that would be called a Normalization Form. Not in Unicode, but it can be called Grapheme Composition. Thus \c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE] \c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE] should all lead to the same grapheme (my personal assumption). Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that. What's specified: 1) A grapheme is 1 character, thus has 'length' 1. 2) A grapheme has a unique internal representation as an integer for some life-time (process), outside the Unicode codepoints. 3) Graphemes can be normalized to NFD, NFC etc. [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? 3) Details of 'life-time', round-trip. 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Do we really need to be able to map arbitrary graphemes to integers, or is it enough to have an opaque value returned by ord() that, when fed to chr(), returns the same grapheme? If the latter, a list of code points (in one of the official Normalzation Formats) would seem to be sufficient. On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote: Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer -- Sent from my mobile device Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
If you haven't read the PDD, it's a good start. To summarize, probably oversimplifying badly: 1. A grapheme is a character *as seen on the page.* That is, if composing a + dot above + dot below produces an a with dots above and below it, then THAT is the grapheme. 2. Unicode has a lot of characters that are single code points representing a complex grapheme. For example, the A + ring above composition shows up as the Angstrom symbol. 3. But on the other hand, some combination of basic characters plus combining marks DO NOT have a single code point that represents them. For example, while your girlfriend might compose dotless lowercase i with combining heart above to produce an i with a heart instead of a dot, there isn't a single codepoint in Unicode for that. (Unless girly-grrls got their own code page. Maybe in Unicode 6...) 4. Since that's a considerable PITA to deal with, we now have NFG format, which really should have been called NFW format, IMO. (W = widechars, natch.) Every combination of basic plus combining marks *that gets used* will have a single grapheme allocated. Many of them, like the Angstrom symbol, or O + combining röckdöts, will already have a real unicode grapheme. The rest of them will get negative numbers assigned, one at a time. The negative numbers will only be meaningful to the string they're in, or maybe only to the particular execution context. (There are issues with comparing, etc. Which is why I think maybe one table per execution.) 5. The result is that every grapheme (letter-on-the-page) will have a single number behind it, will have a length of 1, etc. So we can do meaningful substr($str, 2, 7) and get what we expect, even when the fifth grapheme requires a base character plus 4 combining marks. All hail @Larry! =Austin Mark J. Reed wrote: Do we really need to be able to map arbitrary graphemes to integers, or is it enough to have an opaque value returned by ord() that, when fed to chr(), returns the same grapheme? If the latter, a list of code points (in one of the official Normalzation Formats) would seem to be sufficient. On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote: Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
Mark J. Reed wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. There's a couple of cases. First of all, it doesn't have to be an integer. It needs to be a fixed size, and it needs to be orderable, so that we can store a bunch of them in an intelligent fashion, thus making it easy to sort them. With that said, integers meet the need exactly. Plus, there's the benefit that unicode already has an escape hatch built in to it for user-defined stuff. And that escape hatch is an integer. The benefits are documented in the pod: they're fixed size, so we can scan over them forward and backward at low cost. They're easily distinguished (high bit set) so string code can special-case them quickly. They're orderable, comparable, etc. And best of all they contain no trans fat! =Austin
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. As an implementation detail however, it's important to note that the signed/unsigned distinction gives us a great deal of latitude in how to store a particular sequence of integers. Latin-1 will (by definition) fit in a *uint8, while ASCII plus (no more that 128) NFG negatives will fit into *int8. Most European languages will fit into *int16 with up to 32768 synthetic chars. Most Asian text still fits into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. That is, NFG is always abstract codepoints of some sort without regard to the underlying representation. In that sense it's not important that synthetic codepoints are negative, of course. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. Yup, that was exactly what I was arguing. In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. Which is what we have with the negative integer spec. What I dislike is the transient, handlish nature of those values: like a handle, you can't store the value and then use it to reconstruct the grapheme later. But since actually storing the grapheme itself should be no great feat, I guess that's not much of a hardship. On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote: you can already write complete ord/chr nonsense at the codepoint level (even in ASCII) Sorry, could you clarify what you mean by that? And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. s/top bit/top 11 bits/... Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. They are also represented by a single value in UTF-8; that is, the full scalar value is encoded directly, rather being first encoded into UTF-16 surrogates which are then encoded as UTF-8... That is, NFG is always abstract codepoints of some sort Barely-relevant terminology nit: abstract code points sounds like something that would be associated with abstract characters, which as defined in Unicode are formally distinct from graphemes, which is what we're talking about here. Also, the term code points includes the surrogates, which can only appear in UTF-16; I imagine the scalar values we deal with most of the time at the character/grapheme level would be the subset of code points excluding surrogates, which are called Unicode scalar values. Surrogates are just weird, since they have assigned code points even though they're purely an encoding mechanism. As such, they straddle the line between abstract characters and an encoding form. I assume that if text comes in as UTF-16, the surrogates will disappear as far as character-level P6 code is concerned. So is there any way for P6 to manipulate surrogates as characters? Maybe an adverb or trait? Or does one have to descend to the bytewise layer for that? (As you said, that *normally* shouldn't be necessary outside encoding and decoding, where you need to do things bytewise anyway; just trying to cover all the bases...) -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE Yes, presumably that comes with the normalization part of NFG. We're not aiming for round-tripping of synthetic codepoints, just as NFC doesn't do round-tripping of sequences that have precomposed codepoints. We're really just extending the NFC notion a bit further to encompass temporary precomposed codepoints. 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? No opinion, other than that we're aiming for the most modern formulation that doesn't implicitly cede declarational control to something out of the control of Perl 6 declarations. (See locales for an example of something Perl 6 ignores in the absence of an explicit declaration to pay attention to them.) So just guessing from the names without reading the Annex in question, not legacy, but probably extended, with explicitly tailoring allowed by declaration. (Unless extended has some dire performance or policy consequences that would be contraindicative...) So as long as we stay inside these fundamental Perl 6 design principles, feel free to whack on the specs. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote: : Surrogates are just weird, since they have assigned code points even : though they're purely an encoding mechanism. As such, they straddle : the line between abstract characters and an encoding form. I assume : that if text comes in as UTF-16, the surrogates will disappear as far : as character-level P6 code is concerned. I devoutly hope so. UTF-8 is much cleaner than UTF-16 in this regard. (And it's why I qualified my code point with abstract earlier, to mean the UTF-8 interpretion rather than the UTF-16 interpretation.) : So is there any way for P6 : to manipulate surrogates as characters? Maybe an adverb or trait? : Or does one have to descend to the bytewise layer for that? (As you : said, that *normally* shouldn't be necessary outside encoding and : decoding, where you need to do things bytewise anyway; just trying to : cover all the bases...) Buf16 should work for raw UTF-16 just fine. That's one of the main reasons we have buffers in sizes other than 8, after all. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. Why wouldn't a marshalling of an NFG string automatically include the grapheme table? That way you can realize it and immediately use it in fast mode. Alternatively, if you were providing a persistent string service, a post-marshalling step could re-normalize it in local NFG. The response in NFG could either use the same table you sent (if the response is a subset of the original string) or could attach its own table for translation at your end. =Austin
Re: Unicode in 'NFG' formation ?
Larry Wall wrote: Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I think that a DoS attack on Unicode would be called IBM/Windows Code Pages. The rest of the world have been suffering this attack for the last 40 years. I'm not sure anyone would notice, at this point. :-)
Re: Unicode in 'NFG' formation ?
Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. My feelings, in general. It appears that the concept of mapping total graphemes to integers, negative, etc. is an implementation decision. Perl 6 strings has a concept of graphemes, and functions that work with them. But the core language specification should keep that as general as possible, and allow implementation freedom. The statement that base moda modb produces the same grapheme value as base modb moda is at the correct level. The statement the grapheme is an Int is not only at the wrong level, but not right, as they should be their own distinct type. I think that the PDD details of assigning negative values as encountered AND the idea of being a list of code points in some normalized form, AND the idea of it being a buffer of bytes in UTF8 with that list of code points encoded therein, are all *allowed* as correct implementations. So is having a type whose instance data stores it in however many forms it wants, and for the Perl end of things you just let the === operator take its natural course. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. Well, you can view a string as bytes of UTF8, code points, or graphemes. If you want numbers you probably wanted the first two. A grapheme object should in some ways behave as a string of 1 grapheme and allow you to obtain bytes of UTF8 or code points, easily. Now object identity, the address of an object, is not mandated to be an Int or even numeric. Different types can return different things even. The only thing we know is that infix:=== uses them. Should graphemes be any different? A grapheme object has observed behavior (encode it as...) and internal unobserved behavior. Perhaps we need more assertions such as saying that it can serve as hash keys properly, rather than going all the way to saying that they must be numbered. Especially with an internal numbering system that changes from run to run! Meanwhile... that's what the Str class does. It still has nothing to do with how source code is parsed. To that extent, mentioning it in S02, at least in that section, is a mistake. A see-also to general Perl Unicode documentation would not be objectionable. Also, I described more detailed, formal handling of the input stream to the Perl 6 parser last year: http://www.dlugosz.com/Perl6/specdoc.pdf in Section 3.1. It was discussed on this mailing list when I was starting it. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. Playing the Devil's Advocate here, some other discussion on this thread made me think of something. People already write code that expects ord's to be ordered. Instead of saying, well, use code points if you want to do that we can encourage people to embrace graphemes and say don't use code points or bytes! Use graphemes! if they behave in a familiar enough manner. So on one hand I say viva la revolution!, graphemes are modeled after the object identity, which is totally opaque except for equality testing. But on the other hand, I want to say they may be funky inside, but you can still _use_ them in the ways you want... and assure that they work as hash keys and are not only ordered but include ASCII ordering as a subgroup. But, still not disallow any good implementation ideas that befit totally different implementations. Of course, that's not a problem unique to graphemes. The object identity keys, for example. Any forward-thinking that replaces old values with magic cookies. Perhaps we need a general class that will assign orderable tags to arbitrary values and remember the mapping, and use that for more general cases. It can be explicitly specialized to use any implementation-dependent ordering that actually exists on that type, and the general case would just be to memo-ize an int mapping. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. How many implementations will that break? If they want fixed size, 64-bits should do for now. Also, if the spec doesn't list a requirement for a minimum implement ion limit, *any* fixed-size implementation will be incorrect even if untestable as such. --John
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 21:54 , Larry Wall wrote: On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. If you're working with externally generated Unicode, you may not have that option. I've gotten some bizarre combinations out of Word in Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact, that in the end I used gedit on FreeBSD). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
John M. Dlugosz wrote: I was going over S02, and found it opens with, By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links. This opens a whole bunch of questions for me. If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code? Even so, that's not something that would be called a Normalization Form. Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that. Can someone catch me up on the particulars? I noticed and asked about this a few months ago. As you say, NFG was invented for Perl 6 and/or Parrot. See http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html for all the formal details that exist to my knowledge. Back at the time I raised the issue, it was said that we need to take that Parrot PDD 28 and derive the initial Perl 6 Synopsis 15 from it. Such a Synopsis could basically just start out as a clone of the Parrot document. I said that someday I might have the round-tuit for this, but as yet I didn't. Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. -- Darren Duncan