Re: r25122 - docs/Perl6/Spec
pugs-comm...@feather.perl6.nl wrote: In the abstract, Perl is written in Unicode, and has consistent Unicode -semantics regardless of the underlying text representations. +semantics regardless of the underlying text representations. By default +Perl presents Unicode in NFG formation, where each grapheme counts as +one character. A grapheme is what the novice user would think of as a +character in their normal everyday life, including any diacritics. What's with this NFG / Normal Form G that you refer to? I don't see any mention of that in http://unicode.org/reports/tr15/ ... did you mean NFC? For that matter, is it possible for all realistic combinations of diacritics and base letters to be represented by a single Unicode codepoint, including all language-dependent graphemes? I thought NFC sort of did one codepoint per grapheme but there were a few exceptions ... I could be wrong on that point. -- Darren Duncan
Re: r25122 - docs/Perl6/Spec
On Fri, Jan 30, 2009 at 6:30 AM, Darren Duncan dar...@darrenduncan.net wrote: pugs-comm...@feather.perl6.nl wrote: By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. A grapheme is what the novice user would think of as a character in their normal everyday life, including any diacritics. What's with this NFG / Normal Form G that you refer to? I don't see any mention of that in http://unicode.org/reports/tr15/ ... did you mean NFC? As far as I can tell, NFG isn't an official Unicode Normalization Format; it's a HLL thing, and it has nothing to do with code points. When you ask Perl6 for one character, what you get back (by default) is one grapheme - presumably as defined by UAX #29 - which may be one or more code points, and who knows how many bytes it winds up encoded as in memory. Applescript 2.0 takes this approach as well. So are there any non-opaque, non-string grapheme representations? Does ord() work on them? In AS, the equivalent function is allowed to return a list of numbers instead of just a single number; in either case, the value can be passed to the chr() equivalent to get the same grapheme back. For that matter, is it possible for all realistic combinations of diacritics and base letters to be represented by a single Unicode codepoint, including all language-dependent graphemes? Absolutely not. Again, nobody said anything about code points. We're talking about Perl6's idea of characters. -- Mark J. Reed markjr...@gmail.com
Re: r25122 - docs/Perl6/Spec
On Fri, Jan 30, 2009 at 03:30:02AM -0800, Darren Duncan wrote: pugs-comm...@feather.perl6.nl wrote: In the abstract, Perl is written in Unicode, and has consistent Unicode -semantics regardless of the underlying text representations. +semantics regardless of the underlying text representations. By default +Perl presents Unicode in NFG formation, where each grapheme counts as +one character. A grapheme is what the novice user would think of as a +character in their normal everyday life, including any diacritics. What's with this NFG / Normal Form G that you refer to? I don't see any mention of that in http://unicode.org/reports/tr15/ ... did you mean NFC? Nope, this is a Perl/Parrot idea. It started out with a notion of mine a year ago. Search for 'grapheme' in http://use.perl.org/~chromatic/journal/35461 We named it NFG about the time Simon Cozens wrote a PDD for it for parrot. At the moment it's much better specced in Parrotland than in P6land. See http://www.parrotcode.org/docs/pdd/pdd28_strings.html NFG stands for Normalization Form G, where the G is short for grapheme. And before anyone asks, yes, we were aware of the other gloss for NFG when we picked it. :) For that matter, is it possible for all realistic combinations of diacritics and base letters to be represented by a single Unicode codepoint, including all language-dependent graphemes? No, that is the vision of NFC, but there are potentially an infinite number of graphemes that can be composed in Unicode. NFG aims to represent each of those locally as a single integer, and translate back out to a more standard normalization form on output. I thought NFC sort of did one codepoint per grapheme but there were a few exceptions ... I could be wrong on that point. You are correct, NFC doesn't do all that we want. By the way, we could use someone to write the Perl 6 Unicode synopsis, based on PDD 28. Larry
Re: r25122 - docs/Perl6/Spec
On Fri, 2009-01-30 at 08:12 +0100, pugs-comm...@feather.perl6.nl wrote: @@ -103,7 +106,7 @@ =item * POD sections may be used reliably as multiline comments in Perl 6. -Unlike in Perl 5, POD syntax now requires that C=begin comment +Unlike in Perl 5, POD syntax now lets you use C=begin comment and C=end comment delimit a POD block correctly without the need for C=cut. (In fact, C=cut is now gone.) The format name does not have to be Ccomment -- any unrecognized format name will do I believe that with this change in wording the next line needs to use 'to delimit' rather than just 'delimit'. -'f
Re: r25122 - docs/Perl6/Spec
On Fri, Jan 30, 2009 at 10:28:43AM -0800, Geoffrey Broadwell wrote: : On Fri, 2009-01-30 at 08:12 +0100, pugs-comm...@feather.perl6.nl wrote: : @@ -103,7 +106,7 @@ : =item * : : POD sections may be used reliably as multiline comments in Perl 6. : -Unlike in Perl 5, POD syntax now requires that C=begin comment : +Unlike in Perl 5, POD syntax now lets you use C=begin comment : and C=end comment delimit a POD block correctly without the need : for C=cut. (In fact, C=cut is now gone.) The format name does : not have to be Ccomment -- any unrecognized format name will do : : I believe that with this change in wording the next line needs to use : 'to delimit' rather than just 'delimit'. You've got a commit bit, I believe. :) Larry
Re: r25122 - docs/Perl6/Spec
Larry Wall wrote: On Fri, Jan 30, 2009 at 03:30:02AM -0800, Darren Duncan wrote: What's with this NFG / Normal Form G that you refer to? I don't see any mention of that in http://unicode.org/reports/tr15/ ... did you mean NFC? Nope, this is a Perl/Parrot idea. It started out with a notion of mine a year ago. Search for 'grapheme' in http://use.perl.org/~chromatic/journal/35461 We named it NFG about the time Simon Cozens wrote a PDD for it for parrot. At the moment it's much better specced in Parrotland than in P6land. See http://www.parrotcode.org/docs/pdd/pdd28_strings.html Okay, I understand now. NFG is designed just as a temporary in-process normal form where the same representation of a character as a number can't reliably be consistent over the long term, unlike NFC/D/KC/KD/etc. It does occur to me, though, that as long as we include the generated lookup table (not required for NFC/etc), NFG can be serialized as is and be unambiguously understood by NFG-savvy programs over the long term. Much how LZW (name?) compression works, that includes its own lookup table. So as long as this nature of NFG is understood, and if necessary any serialized forms will include a spec version num / etc as protection in the face of upgrades, this could also stand to be a standard beyond Perl/Parrot/etc. I wonder if the Unicode consortium would be interested in adopting an NFG-alike, or whether that would be beyond their scope? By the way, we could use someone to write the Perl 6 Unicode synopsis, based on PDD 28. Well, if someone else doesn't do it first, I don't think it would be too difficult for me to do this, at least the initial based-on-PDD-28 cut; however it would likely be a few weeks before I get around to it, partly since I don't have a Pugs repo checkout in place ... maybe when I port the new Set::Relation to Perl 6, requiring such a checkout, I may do that too ... but don't wait for me. By the way, in the mean-time, someone should update that reference to NFG in S02 to include a link to that PDD28, so other people encountering it don't have to ask the same question I did. -- Darren Duncan