Re: unicode
On 17/09/16 13:34, Moritz Lenz wrote:>> Searching further I found the ucd2c.pl program in the Moarvm tools >> directory. This generates the unicode_db.c somewhere else in the >> rakudo tree. I run this program myself on the Unicode 9.0.0 >> database and comparing the generated files shows many differences >> between the one in the rakudo tree and the generated one. > > Please make a rakudo spectest with those changes, and if it passes, > submit your patch as a pull request. Unicode support is more than just having the data from the text files in our own unicode database. In Unicode 9, the Zero Width Joiner is now explicitly supported for emoji. If we don't change the algorithm to create individual graphemes from streams of codepoints to consider this, we'll end up with improper support for 8 (because new stuff is in there) and improper support for 9 (because some stuff is missing) at the same time; i suspect that'll help nobody. I expect Jnthn will do the full & proper update during the coming month, and running ucd2c.pl is the least time-consuming step of that, but perhaps a pull request for this is still welcome.
Re: unicode
Hi, I am looking forward to it Thanks, Marcel On Sat, Sep 17, 2016 at 01:34:45PM +0200, Moritz Lenz wrote: Hi, On 17.09.2016 13:12, MT wrote: The date found in the file unicode_db.c file is 2012-07-20 which is about Unicode version 6.1.0 So the content in that file is not getting updated when the shipped Unicode version is updated? If so, is there a tool that needs fixing to automate that? docs/ChangeLog in MoarVM says + Updated to Unicode 8 in the section of the 2015.07 release, so it's not that bad :-) I believe that the plan is to update to Unicode 9 just after this month's release (to give a whole month to iron out any instabilities or bugs). So it might be a little bit bad this month, but next month will be awesome. Allegedly :-) Nicholas Clark
Re: unicode
On Sat, Sep 17, 2016 at 01:34:45PM +0200, Moritz Lenz wrote: > Hi, > > On 17.09.2016 13:12, MT wrote: > > The date found in the file unicode_db.c file is 2012-07-20 which is > > about Unicode version 6.1.0 So the content in that file is not getting updated when the shipped Unicode version is updated? If so, is there a tool that needs fixing to automate that? > docs/ChangeLog in MoarVM says > > + Updated to Unicode 8 > > in the section of the 2015.07 release, so it's not that bad :-) I believe that the plan is to update to Unicode 9 just after this month's release (to give a whole month to iron out any instabilities or bugs). So it might be a little bit bad this month, but next month will be awesome. Allegedly :-) Nicholas Clark
Re: unicode
Searching further I found the ucd2c.pl program in the Moarvm tools directory. This generates the unicode_db.c somewhere else in the rakudo tree. I run this program myself on the Unicode 9.0.0 database and comparing the generated files shows many differences between the one in the rakudo tree and the generated one. Please make a rakudo spectest with those changes, and if it passes, submit your patch as a pull request. The date found in the file unicode_db.c file is 2012-07-20 which is about Unicode version 6.1.0 How do I proceed from here? Do I pull in the newest rakudo version, make another git branch, then change it and then push the branch after the tests have run successfully ? This way I am not able to cripple the rakudo code. Other people can check the changes too before merging. docs/ChangeLog in MoarVM says + Updated to Unicode 8 in the section of the 2015.07 release, so it's not that bad :-) I have seen it now, indeed not that old, but it means also the Unicode changes a lot between versions. Greets, Marcel
Re: unicode
Hi, On 17.09.2016 13:12, MT wrote: > Searching further I found the ucd2c.pl program in the Moarvm tools > directory. This generates the unicode_db.c somewhere else in the rakudo > tree. I run this program myself on the Unicode 9.0.0 database and > comparing the generated files shows many differences between the one in > the rakudo tree and the generated one. Please make a rakudo spectest with those changes, and if it passes, submit your patch as a pull request. > The date found in the file unicode_db.c file is 2012-07-20 which is > about Unicode version 6.1.0 docs/ChangeLog in MoarVM says + Updated to Unicode 8 in the section of the 2015.07 release, so it's not that bad :-) Cheers, Moritz -- Moritz Lenz https://deploybook.com/ -- https://perlgeek.de/ -- https://perl6.org/
Re: Unicode Categories
Tom Christiansen wrote: Patrick wrote: : * Almost. E.g. isL would be nice to have as well. : : Those exist also: : : $ ./perl6 : say 'abCD34' ~~ / isL / : a : say 'abCD34' ~~ / isN / : 3 : They may exist, but I'm not certain it's a good idea to encourage the Is_XXX approach on *anything* except Script=XXX properties. They certainly don't work on everything, you know. Also, I can't for the life of me why one would ever write isL when Letter is so much more obvious; similarly, for isN over Number. Just because you can do so, doesn't mean you necessarily should. http://unicode.org/reports/tr18/#Categories The recommended names for UCD properties and property values are in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue]. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored. Furthermore, be aware that the Number property is *NOT* the same as the Decimal_Number property. In perl5, if one wants [0-9], then one expresses it exactly that way, since that's a lot shorter than writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number. Again, please that Number is far broader than even Decimal_Number, which is itself almost certainly broader than you're thinking. Here's a trio of little programs specifically designed to help scout out Unicode characters and their properties. They work best on 5.12+, but should be ok on 5.10, too. --tom The 'Is' prefix can be used on any property in 5.12 for which there is no naming conflict. The only naming conflicts are certain of the block properties, such as Arabic. IsArabic means the Arabic script. InArabic means the base Arabic block. Personally, I find Is and In unintuitive, and prefer to write sc=arabic or blk=arabic instead. When Unicode proposed to add some properties in 5.2 that started with 'Is', there was significant enough protest that they backed off, and promised never to do it again, adding a stability policy to 6.0 to that effect. Apparently a number of languages use 'Is' as a prefix.
Re: Unicode Categories
The 'Is' prefix can be used on any property in 5.12 for which there is no naming conflict. The only naming conflicts are certain of the block properties, such as Arabic. IsArabic means the Arabic script. InArabic means the base Arabic block. Personally, I find Is and In unintuitive, and prefer to write sc=arabic or blk=arabic instead. I agree. When Unicode proposed to add some properties in 5.2 that started with 'Is', there was significant enough protest that they backed off, and promised never to do it again, adding a stability policy to 6.0 to that effect. Apparently a number of languages use 'Is' as a prefix. Yes, that's right. Even worse, there are languages that are very very bad about Is vs In, giving the wrong sense to them. --tom
Re: Unicode Categories
On Wed, Nov 10, 2010 at 01:03:26PM -0500, Chase Albert wrote: Sorry if this is the wrong forum. I was wondering if there was a way to specify unicode categorieshttp://www.fileformat.info/info/unicode/category/index.htmin a regular expression (and hence a grammar), or if there would be any consideration for adding support for that (requiring some kind of special syntax). Unicode categories are done using assertion syntax with is followed by the category name. Thus isLu (uppercase letter), isNd (decimal digit), isZs (space separator), etc. This even works in Rakudo today: $ ./perl6 say 'abcdEFG' ~~ / isLu / E They can also be combined, as in +isLu+isLt (uppercase+titlecase). The relevant section of the spec is in Synopsis 5; search for Unicode properties are always available with a prefix. Hope this helps! Pm
Re: Unicode Categories
That's exactly what I was looking for*. Awesome, thank you. ~Cheers * Almost. E.g. isL would be nice to have as well. On Wed, Nov 10, 2010 at 13:15, Patrick R. Michaud pmich...@pobox.comwrote: Unicode properties are always available with a prefix
Re: Unicode Categories
On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote: That's exactly what I was looking for*. Awesome, thank you. * Almost. E.g. isL would be nice to have as well. Those exist also: $ ./perl6 say 'abCD34' ~~ / isL / a say 'abCD34' ~~ / isN / 3 Pm
Re: Unicode Categories
Even awesomer, thank you again. On Wed, Nov 10, 2010 at 13:28, Patrick R. Michaud pmich...@pobox.comwrote: On Wed, Nov 10, 2010 at 01:21:57PM -0500, Chase Albert wrote: That's exactly what I was looking for*. Awesome, thank you. * Almost. E.g. isL would be nice to have as well. Those exist also: $ ./perl6 say 'abCD34' ~~ / isL / a say 'abCD34' ~~ / isN / 3 Pm
Re: Unicode Categories
Patrick wrote at 12:15pm CST on Wednesday, 10 November 2010: Sorry if this is the wrong forum. I was wondering if there was a way to specify unicode categorieshttp://www.fileformat.info/info/unicode/category/index.htmin a regular expression (and hence a grammar), or if there would be any consideration for adding support for that (requiring some kind of special syntax). Unicode categories are done using assertion syntax with is followed by the category name. Thus isLu (uppercase letter), isNd (decimal digit), isZs (space separator), etc. This even works in Rakudo today: $ ./perl6 say 'abcdEFG' ~~ / isLu / E They can also be combined, as in +isLu+isLt (uppercase+titlecase). The relevant section of the spec is in Synopsis 5; search for Unicode properties are always available with a prefix. Hope this helps! Actually, that quote from Synopsis raises more questions than it answers. Below I've annonated the three output groups with (letters): % uniprops -a A U+0041 ‹A› \N{ LATIN CAPITAL LETTER A }: (A)\w \pL \p{LC} \p{L_} \p{L} \p{Lu} (B)AHex ASCII_Hex_Digit All Any Alnum Alpha Alphabetic ASCII Assigned Cased Cased_Letter LC Changes_When_Casefolded CWCF Changes_When_Casemapped CWCM Changes_When_Lowercased CWL Changes_When_NFKC_Casefolded CWKCF Lu L Gr_Base Grapheme_Base Graph GrBase Hex XDigit Hex_Digit ID_Continue IDC ID_Start IDS Letter L_ Latin Latn Uppercase_Letter PerlWord PosixAlnum PosixAlpha PosixGraph PosixPrint PosixUpper Print Upper Uppercase Word XID_Continue XIDC XID_Start XIDS (C)Age:1.1 Block=Basic_Latin Bidi_Class:L Bidi_Class=Left_To_Right Bidi_Class:Left_To_Right Bc=L Block:ASCII Block:Basic_Latin Blk=ASCII Canonical_Combining_Class:0 Canonical_Combining_Class=Not_Reordered Canonical_Combining_Class:Not_Reordered Ccc=NR Canonical_Combining_Class:NR Decomposition_Type:None Dt=None East_Asian_Width:Na East_Asian_Width=Narrow East_Asian_Width:Narrow Ea=Na Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA Hangul_Syllable_Type=Not_Applicable Hangul_Syllable_Type:Not_Applicable Hst=NA Joining_Group:No_Joining_Group Jg=NoJoiningGroup Joining_Type:Non_Joining Jt=U Joining_Type:U Joining_Type=Non_Joining Script=Latin Line_Break:AL Line_Break=Alphabetic Line_Break:Alphabetic Lb=AL Numeric_Type:None Nt=None Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1 Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0 In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2 Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0 In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2 Script:Latin Sc=Latn Script:Latn Sentence_Break:UP Sentence_Break=Upper Sentence_Break:Upper SB=UP Word_Break:ALetter WB=LE Word_Break:LE Word_Break=ALetter What that means is that the B properties are properties from the *General* category. They may all be referred to as \p{X} or \p{IsX}, \p{General_Category=X} or \p{General_Category:X}, and \p{GC=X} or \p{GC:X}. I have a feeling that your synopsis quote is referring only to type B properties alone. It is not talking about type C properties, which must also be accounted for. --tom
Re: Unicode Categories
Patrick wrote: : * Almost. E.g. isL would be nice to have as well. : : Those exist also: : : $ ./perl6 : say 'abCD34' ~~ / isL / : a : say 'abCD34' ~~ / isN / : 3 : They may exist, but I'm not certain it's a good idea to encourage the Is_XXX approach on *anything* except Script=XXX properties. They certainly don't work on everything, you know. Also, I can't for the life of me why one would ever write isL when Letter is so much more obvious; similarly, for isN over Number. Just because you can do so, doesn't mean you necessarily should. http://unicode.org/reports/tr18/#Categories The recommended names for UCD properties and property values are in PropertyAliases.txt [Prop] and PropertyValueAliases.txt [PropValue]. There are both abbreviated names and longer, more descriptive names. It is strongly recommended that both names be recognized, and that loose matching of property names be used, whereby the case distinctions, whitespace, hyphens, and underbar are ignored. Furthermore, be aware that the Number property is *NOT* the same as the Decimal_Number property. In perl5, if one wants [0-9], then one expresses it exactly that way, since that's a lot shorter than writing (?=\p{ASCII})\p{Nd}, where Nd can also be Decimal_Number. Again, please that Number is far broader than even Decimal_Number, which is itself almost certainly broader than you're thinking. Here's a trio of little programs specifically designed to help scout out Unicode characters and their properties. They work best on 5.12+, but should be ok on 5.10, too. --tom unitrio.tar.gz Description: application/tar
Re: Unicode in 'NFG' formation ?
Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? This will work in most cases, but e.g. not with the property ASCII_Hex_Digit. LATIN SMALL LETTER A is ASCII_Hex_Digit but GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE is_not ASCII_Hex_Digit I will try to generate some millions of cases based on nfc(nfd($string)) to find out the best inheritance rules. 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? No opinion, other than that we're aiming for the most modern formulation that doesn't implicitly cede declarational control to something out of the control of Perl 6 declarations. (See locales for an example of something Perl 6 ignores in the absence of an explicit declaration to pay attention to them.) So just guessing from the names without reading the Annex in question, not legacy, but probably extended, with explicitly tailoring allowed by declaration. (Unless extended has some dire performance or policy consequences that would be contraindicative...) Will look into ICU what's supported. So as long as we stay inside these fundamental Perl 6 design principles, feel free to whack on the specs. OK. Hopefully some Indic, Arabic and Asian natives review this. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE Yes, presumably that comes with the normalization part of NFG. We're not aiming for round-tripping of synthetic codepoints, just as NFC doesn't do round-tripping of sequences that have precomposed codepoints. We're really just extending the NFC notion a bit further to encompass temporary precomposed codepoints. Unique for asking for the name, not when specifying the name. Just as with the code-point order, any combination that means the same should give the same grapheme, just as if you had create the code point sequence first. Perhaps you are not realizing that the different classes of modifiers are independent. You could say DOT ABOVE AND DOT ABOVE and get the same thing as DOT BELOW and DOT ABOVE. 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? Depends on the property! Being a modifier, for example. A detailed look would be needed to decide which properties just pass through to the base char, which are enhanced (e.g. letter becomes letter with modifiers), which don't make sense, which are mostly OK but change sometimes, etc.
Re: Unicode in 'NFG' formation ?
John M. Dlugosz wrote: I was going over S02, and found it opens with, By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links. As Durran already wrote, the only definition is in http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html which references 'Unicode Normalization Forms' http://www.unicode.org/reports/tr15/. Also there is a reference to The Unicode Standard defines a grapheme cluster (commonly simplified to just grapheme). IMHO the authors meant this document: Unicode Standard Annex #29 Unicode Text Segmentation http://unicode.org/reports/tr29/ This opens a whole bunch of questions for me. I have many unanswered questions [1] about graphemes. If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code? First - nothing. S01: Perl 6 is written in Unicode. Developers can choose one of the encodings (UTF-8, UTF-17 etc.) for files with Perl source code. Characters outside the ASCII range can be used for identifiers, literals, and syntactic punctuation (e.g. 'bracketing pairs'). It's a problem of the parser to handle it correctly. Even so, that's not something that would be called a Normalization Form. Not in Unicode, but it can be called Grapheme Composition. Thus \c[LATIN SMALL LETTER A, COMBINING DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A, COMBINING DOT BELOW, COMBINING DOT ABOVE] \c[LATIN SMALL LETTER A WITH DOT ABOVE, COMBINING DOT BELOW] \c[LATIN SMALL LETTER A WITH DOT BELOW, COMBINING DOT ABOVE] should all lead to the same grapheme (my personal assumption). Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that. What's specified: 1) A grapheme is 1 character, thus has 'length' 1. 2) A grapheme has a unique internal representation as an integer for some life-time (process), outside the Unicode codepoints. 3) Graphemes can be normalized to NFD, NFC etc. [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? 3) Details of 'life-time', round-trip. 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer
Re: Unicode in 'NFG' formation ?
Do we really need to be able to map arbitrary graphemes to integers, or is it enough to have an opaque value returned by ord() that, when fed to chr(), returns the same grapheme? If the latter, a list of code points (in one of the official Normalzation Formats) would seem to be sufficient. On 5/18/09, Helmut Wollmersdorfer hel...@wollmersdorfer.at wrote: Darren Duncan wrote: Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. IMHO we need some people for a broad discussion on the details first. Helmut Wollmersdorfer -- Sent from my mobile device Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
Mark J. Reed wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. There's a couple of cases. First of all, it doesn't have to be an integer. It needs to be a fixed size, and it needs to be orderable, so that we can store a bunch of them in an intelligent fashion, thus making it easy to sort them. With that said, integers meet the need exactly. Plus, there's the benefit that unicode already has an escape hatch built in to it for user-defined stuff. And that escape hatch is an integer. The benefits are documented in the pod: they're fixed size, so we can scan over them forward and backward at low cost. They're easily distinguished (high bit set) so string code can special-case them quickly. They're orderable, comparable, etc. And best of all they contain no trans fat! =Austin
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 09:21 , Mark J. Reed wrote: If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. What is the meaning of ord(\c[LATIN LETTER T WITH DOT ABOVE, COMBINING DOT BELOW]) + 1? If you say it increments the base character (a reasonable-looking initial stance), what happens if I add an amount which changes the base character to a combining character? And what happens if the original grapheme doesn't have a base character? In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. If you want to represent it as an integer, fine, but it should be obscured such that math isn't possible on it. Conversely, if you want ord() values you can manipulate, you must work at the codepoint level. Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. As an implementation detail however, it's important to note that the signed/unsigned distinction gives us a great deal of latitude in how to store a particular sequence of integers. Latin-1 will (by definition) fit in a *uint8, while ASCII plus (no more that 128) NFG negatives will fit into *int8. Most European languages will fit into *int16 with up to 32768 synthetic chars. Most Asian text still fits into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. That is, NFG is always abstract codepoints of some sort without regard to the underlying representation. In that sense it's not important that synthetic codepoints are negative, of course. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 12:37:49PM -0400, Brandon S. Allbery KF8NH wrote: I would argue that if you are working with a grapheme cluster (grapheme), arithmetic on individual grapheme values is undefined. Yup, that was exactly what I was arguing. In short, I think the only remotely sane result of ord() on a grapheme is an opaque value meaningful to chr() but to very little, if anything, else. Which is what we have with the negative integer spec. What I dislike is the transient, handlish nature of those values: like a handle, you can't store the value and then use it to reconstruct the grapheme later. But since actually storing the grapheme itself should be no great feat, I guess that's not much of a hardship. On Mon, May 18, 2009 at 1:37 PM, Larry Wall la...@wall.org wrote: you can already write complete ord/chr nonsense at the codepoint level (even in ASCII) Sorry, could you clarify what you mean by that? And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. s/top bit/top 11 bits/... Note also that uint8 has nothing to do with UTF-8, and uint16 has nothing to do with UTF-16. Surrogate pairs are represented by a single integer in NFG. They are also represented by a single value in UTF-8; that is, the full scalar value is encoded directly, rather being first encoded into UTF-16 surrogates which are then encoded as UTF-8... That is, NFG is always abstract codepoints of some sort Barely-relevant terminology nit: abstract code points sounds like something that would be associated with abstract characters, which as defined in Unicode are formally distinct from graphemes, which is what we're talking about here. Also, the term code points includes the surrogates, which can only appear in UTF-16; I imagine the scalar values we deal with most of the time at the character/grapheme level would be the subset of code points excluding surrogates, which are called Unicode scalar values. Surrogates are just weird, since they have assigned code points even though they're purely an encoding mechanism. As such, they straddle the line between abstract characters and an encoding form. I assume that if text comes in as UTF-16, the surrogates will disappear as far as character-level P6 code is concerned. So is there any way for P6 to manipulate surrogates as characters? Maybe an adverb or trait? Or does one have to descend to the bytewise layer for that? (As you said, that *normally* shouldn't be necessary outside encoding and decoding, where you need to do things bytewise anyway; just trying to cover all the bases...) -- Mark J. Reed markjr...@gmail.com
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: [1] Open questions: 1) Will graphemes have an unique charname? e.g. GRAPHEME LATIN SMALL LETTER A WITH DOT BELOW AND DOT ABOVE Yes, presumably that comes with the normalization part of NFG. We're not aiming for round-tripping of synthetic codepoints, just as NFC doesn't do round-tripping of sequences that have precomposed codepoints. We're really just extending the NFC notion a bit further to encompass temporary precomposed codepoints. 2) Can I use Unicode property matching safely with graphemes? If yes, who or what maintains the necessary tables? Good question. My assumption is that adding marks to a character doesn't change its fundamental nature. What needs to be provided other pass-through to the base character's properties? 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). 4) Should the definition of graphemes conform to Unicode Standard Annex #29 'grapheme clusters'? Wich level - legacy, extended or tailored? No opinion, other than that we're aiming for the most modern formulation that doesn't implicitly cede declarational control to something out of the control of Perl 6 declarations. (See locales for an example of something Perl 6 ignores in the absence of an explicit declaration to pay attention to them.) So just guessing from the names without reading the Annex in question, not legacy, but probably extended, with explicitly tailoring allowed by declaration. (Unless extended has some dire performance or policy consequences that would be contraindicative...) So as long as we stay inside these fundamental Perl 6 design principles, feel free to whack on the specs. Larry
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 02:16:17PM -0400, Mark J. Reed wrote: : Surrogates are just weird, since they have assigned code points even : though they're purely an encoding mechanism. As such, they straddle : the line between abstract characters and an encoding form. I assume : that if text comes in as UTF-16, the surrogates will disappear as far : as character-level P6 code is concerned. I devoutly hope so. UTF-8 is much cleaner than UTF-16 in this regard. (And it's why I qualified my code point with abstract earlier, to mean the UTF-8 interpretion rather than the UTF-16 interpretation.) : So is there any way for P6 : to manipulate surrogates as characters? Maybe an adverb or trait? : Or does one have to descend to the bytewise layer for that? (As you : said, that *normally* shouldn't be necessary outside encoding and : decoding, where you need to do things bytewise anyway; just trying to : cover all the bases...) Buf16 should work for raw UTF-16 just fine. That's one of the main reasons we have buffers in sizes other than 8, after all. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
Brandon S. Allbery KF8NH wrote: On May 18, 2009, at 14:16 , Larry Wall wrote: On Mon, May 18, 2009 at 11:11:32AM +0200, Helmut Wollmersdorfer wrote: 3) Details of 'life-time', round-trip. Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I find mysef wondering if they might need to be standardized anyway; specifically I'm contemplating Erlang-style services. Why wouldn't a marshalling of an NFG string automatically include the grapheme table? That way you can realize it and immediately use it in fast mode. Alternatively, if you were providing a persistent string service, a post-marshalling step could re-normalize it in local NFG. The response in NFG could either use the same table you sent (if the response is a subset of the original string) or could attach its own table for translation at your end. =Austin
Re: Unicode in 'NFG' formation ?
Larry Wall wrote: Which is a very interesting topic, with connections to type theory, scope/domain management, and security issues (such as the possibility of a DoS attack on the translation tables). I think that a DoS attack on Unicode would be called IBM/Windows Code Pages. The rest of the world have been suffering this attack for the last 40 years. I'm not sure anyone would notice, at this point. :-)
Re: Unicode in 'NFG' formation ?
Mark J. Reed markjreed-at-gmail.com |Perl 6| wrote: On Mon, May 18, 2009 at 9:11 AM, Austin Hastings austin_hasti...@yahoo.com wrote: If you haven't read the PDD, it's a good start. snip useful summary I get all that, really. I still question the necessity of mapping each grapheme to a single integer. A single *value*, sure. length($weird_grapheme) should always be 1, absolutely. But why does ord($weird_grapheme) have to be a *numeric* value? If you convert to, say, normalization form C and return a list of the scalar values so obtained, that can be used in any context to reproduce the same grapheme, with no worries about different processes coming up with different assignments of arbitrary negative numbers to graphemes. My feelings, in general. It appears that the concept of mapping total graphemes to integers, negative, etc. is an implementation decision. Perl 6 strings has a concept of graphemes, and functions that work with them. But the core language specification should keep that as general as possible, and allow implementation freedom. The statement that base moda modb produces the same grapheme value as base modb moda is at the correct level. The statement the grapheme is an Int is not only at the wrong level, but not right, as they should be their own distinct type. I think that the PDD details of assigning negative values as encountered AND the idea of being a list of code points in some normalized form, AND the idea of it being a buffer of bytes in UTF8 with that list of code points encoded therein, are all *allowed* as correct implementations. So is having a type whose instance data stores it in however many forms it wants, and for the Perl end of things you just let the === operator take its natural course. If you're doing arithmetic with the code points or scalar values of characters, then the specific numbers would seem to matter. I'm looking for the use case where the fact that it's an integer matters but the specific value doesn't. Well, you can view a string as bytes of UTF8, code points, or graphemes. If you want numbers you probably wanted the first two. A grapheme object should in some ways behave as a string of 1 grapheme and allow you to obtain bytes of UTF8 or code points, easily. Now object identity, the address of an object, is not mandated to be an Int or even numeric. Different types can return different things even. The only thing we know is that infix:=== uses them. Should graphemes be any different? A grapheme object has observed behavior (encode it as...) and internal unobserved behavior. Perhaps we need more assertions such as saying that it can serve as hash keys properly, rather than going all the way to saying that they must be numbered. Especially with an internal numbering system that changes from run to run! Meanwhile... that's what the Str class does. It still has nothing to do with how source code is parsed. To that extent, mentioning it in S02, at least in that section, is a mistake. A see-also to general Perl Unicode documentation would not be objectionable. Also, I described more detailed, formal handling of the input stream to the Perl 6 parser last year: http://www.dlugosz.com/Perl6/specdoc.pdf in Section 3.1. It was discussed on this mailing list when I was starting it. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: Sure, but this is a weak argument, since you can already write complete ord/chr nonsense at the codepoint level (even in ASCII), and all we're doing here is making graphemes work more like codepoints in terms of storage and indexing. If people abuse it, they have only themselves to blame for relying on what is essentially an implementation detail. The whole point of ord is to cheat, so if they get caught cheating, well, they just have to take their lumps. In the age of Unicode, ord and chr are pretty much irrelevant to most normal text processing anyway except for encoders and decoders, so there's not a great deal of point in labeling the integers as an opaque type, in my opinion. Playing the Devil's Advocate here, some other discussion on this thread made me think of something. People already write code that expects ord's to be ordered. Instead of saying, well, use code points if you want to do that we can encourage people to embrace graphemes and say don't use code points or bytes! Use graphemes! if they behave in a familiar enough manner. So on one hand I say viva la revolution!, graphemes are modeled after the object identity, which is totally opaque except for equality testing. But on the other hand, I want to say they may be funky inside, but you can still _use_ them in the ways you want... and assure that they work as hash keys and are not only ordered but include ASCII ordering as a subgroup. But, still not disallow any good implementation ideas that befit totally different implementations. Of course, that's not a problem unique to graphemes. The object identity keys, for example. Any forward-thinking that replaces old values with magic cookies. Perhaps we need a general class that will assign orderable tags to arbitrary values and remember the mapping, and use that for more general cases. It can be explicitly specialized to use any implementation-dependent ordering that actually exists on that type, and the general case would just be to memo-ize an int mapping. --John
Re: Unicode in 'NFG' formation ?
Larry Wall larry-at-wall.org |Perl 6| wrote: into *uint16 as long as they don't synthesize codepoints. And we can always resort to *uint32 and *int32 knowing that the Unicode consortium isn't going to use the top bit any time in the foreseeable future. (Unless, of course, they endorse something resembling NFG. :) No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. How many implementations will that break? If they want fixed size, 64-bits should do for now. Also, if the spec doesn't list a requirement for a minimum implement ion limit, *any* fixed-size implementation will be incorrect even if untestable as such. --John
Re: Unicode in 'NFG' formation ?
On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. Larry
Re: Unicode in 'NFG' formation ?
On May 18, 2009, at 21:54 , Larry Wall wrote: On Mon, May 18, 2009 at 07:59:31PM -0500, John M. Dlugosz wrote: No, a few million code points in the Unicode standard can produce an arbitrary number of unique grapheme clusters, since you can apply as many modifiers as you like to each different base character. If you allow multiples, the total is unbounded. A small program, which ought to go into the test suite g, can generate 4G distinct grapheme clusters, one at a time. That precise behavior is what I was characterizing as a DoS attack. :) So in my head it falls into the Doctor-it-hurts-when-I-do-this category. If you're working with externally generated Unicode, you may not have that option. I've gotten some bizarre combinations out of Word in Hebrew with nikudot, then saved as UTF8 text (so bizarre, in fact, that in the end I used gedit on FreeBSD). -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] allb...@kf8nh.com system administrator [openafs,heimdal,too many hats] allb...@ece.cmu.edu electrical and computer engineering, carnegie mellon universityKF8NH PGP.sig Description: This is a digitally signed message part
Re: Unicode in 'NFG' formation ?
John M. Dlugosz wrote: I was going over S02, and found it opens with, By default Perl presents Unicode in NFG formation, where each grapheme counts as one character. I looked up NFG, and found it to be an invention of this group, but didn't find any details when I tried to chase down the links. This opens a whole bunch of questions for me. If you mean that the default for what the individual items in a string are is graphemes, OK, but what does that have to do with parsing source code? Even so, that's not something that would be called a Normalization Form. Character set encodings and stuff is one of my strengths. I'd like to straighten this out, and can certainly straighten out the wording, but first need to know what you meant by that. Can someone catch me up on the particulars? I noticed and asked about this a few months ago. As you say, NFG was invented for Perl 6 and/or Parrot. See http://docs.parrot.org/parrot/latest/html/docs/pdds/pdd28_strings.pod.html for all the formal details that exist to my knowledge. Back at the time I raised the issue, it was said that we need to take that Parrot PDD 28 and derive the initial Perl 6 Synopsis 15 from it. Such a Synopsis could basically just start out as a clone of the Parrot document. I said that someday I might have the round-tuit for this, but as yet I didn't. Since you seem eager, I recommend you start with porting the Parrot PDD 28 to a new Perl 6 Synopsis 15, and continue from there. -- Darren Duncan
Re: Unicode bracketing spec question
On Thu, 23 Apr 2009, Helmut Wollmersdorfer wrote: Timothy S. Nelson wrote: I note that S02 says that the unicode classes Ps/Pe are blessed to act as opening and closing quotes. Is there a reason that we can't have Pi/Pf blessed too? I ask because there are quotation marks in the Pi/Pf set that are called Substitution and Transposition which I thought might be cool quotes for s/// and tr/// :). You mean 2E00 - 2E2F Supplemental Punctuation New Testament editorial symbols [...] 2E02 LEFT SUBSTITUTION BRACKET 2E03 RIGHT SUBSTITUTION BRACKET [...] 2E09 LEFT TRANSPOSITION BRACKET 2E0A RIGHT TRANSPOSITION BRACKET That sounds like them. But if you really want to use these characters, your source will be hard to read without exotic fonts. You have been warned;-) My fonts don't show them either. But we could call it job protection ;). :) - | Name: Tim Nelson | Because the Creator is,| | E-mail: wayl...@wayland.id.au| I am | - BEGIN GEEK CODE BLOCK Version 3.12 GCS d+++ s+: a- C++$ U+++$ P+++$ L+++ E- W+ N+ w--- V- PE(+) Y+++ PGP-+++ R(+) !tv b++ DI D G+ e++ h! y- -END GEEK CODE BLOCK-
Re: Unicode bracketing spec question
Timothy S. Nelson wrote: I note that S02 says that the unicode classes Ps/Pe are blessed to act as opening and closing quotes. Is there a reason that we can't have Pi/Pf blessed too? I ask because there are quotation marks in the Pi/Pf set that are called Substitution and Transposition which I thought might be cool quotes for s/// and tr/// :). You mean 2E00 - 2E2F Supplemental Punctuation New Testament editorial symbols [...] 2E02 LEFT SUBSTITUTION BRACKET 2E03 RIGHT SUBSTITUTION BRACKET [...] 2E09 LEFT TRANSPOSITION BRACKET 2E0A RIGHT TRANSPOSITION BRACKET Cool idea. But if you really want to use these characters, your source will be hard to read without exotic fonts. You have been warned;-) Helmut Wollmersdorfer
[perl #61394] Re: unicode and macosx
# New Ticket Created by Stephane Payrard # Please include the string: [perl #61394] # in the subject line of all future correspondence about this issue. # URL: http://rt.perl.org/rt3/Ticket/Display.html?id=61394 my $s = ; say $s.chars # now returns 1 Note : the bug was reported on macintel 32 bits which died. I am now testing on a macintel 64 bits. I don't know if it can affect the test. On Mon, May 19, 2008 at 6:28 PM, Stéphane Payrard cognomi...@gmail.com wrote: On a macintel 10.5 I have some problem with unicode. unicode characters are not recognized as such. See the rakudo test below The configuring phase gives : Determining whether ICU is installed...yes. The compiling phase finish with an error but it apprently causes no problems except I can't run 'make test' because of the dependance on a successful compilation. ar: blib/lib/libparrot.a is a fat file (use libtool(1) or lipo(1) and ar(1) on it) ar: blib/lib/libparrot.a: Inappropriate file type or format make: *** [blib/lib/libparrot.a] Error 1 rakudo is generated without problem But the following test fails. I pasted the content of the literal string with a character that emacs says to be #x8a0 my $s = ; say $s.chars # $s == \x8a0 2 I expected one. -- cognominal stef -- cognominal stef
[Patch] Re: Unicode Operators cheatsheet, please!
Rob Kinyon wrote: xOn 5/31/05, Sam Vilain [EMAIL PROTECTED] wrote: Rob Kinyon wrote: I would love to see a document (one per editor) that describes the Unicode characters in use and how to make them. The Set implementation in Pugs uses (at last count) 20 different Unicode characters as operators. I have updated the unicode quickref, and started a Perlmonks discussion node for this to be explored - see http://www.perlmonks.org/index.pl?node_id=462246 As I replied on Perlmonks, it would be more helpful if the Compose keys were listed and not just the ASCII versions. Plus, a quick primer on how to enable Unicode in your favorite editor. I don't know about Emacs, but the Vim documentation on multibyte is difficult to work with, at best. Well, :help digraph isn't particularly bad, though the included table only covers latin-1. The canonical source is RFC1345. But I've attached a patch for the set symbols that have them. Thanks, Rob Index: docs/quickref/unicode === --- docs/quickref/unicode (revision 4305) +++ docs/quickref/unicode (working copy) @@ -21,6 +21,10 @@ Note that the compose combinations here are an X11R6 standard, and do not necessarily correspond to the compose combinations available when you use your compose key. + +The digraphs used in vim come from Character Mnemonics Character Sets, +RFC1345 (http://www.ietf.org/rfc/rfc1345.txt). After doing :set digraph, +the digraph ^k A B may also be entered as A BS B. Unicode ASCIIkey sequence charfallbackVimEmacs Unix Compose Key combination @@ -30,22 +34,22 @@ ¥ Y ^k Y e C-x 8 Y Compose Y = Set.pm operators (included for reference): -≠ != -∩ * -∪ + +≠ != ^k ! = +∩ * ^k ( U +∪ + ^k ) U ∖ - -⊂ -⊃ -⊆ = -⊇ = -⊄ !( $a $b ) +⊂ ^k ( C +⊃ ^k ) C +⊆ = ^k ( _ +⊇ = ^k ) _ +⊄ !( $a $b ) ⊅ !( $a $b ) ⊈ !( $a = $b ) ⊉ !( $a = $b ) -⊊ +⊊ ⊋ -∋/∍ $a.includes($b) -∈/∊ $b.includes($a) +∋/∍ $a.includes($b) ^k ) - +∈/∊ $b.includes($a) ^k ( - ∌!$a.includes($b) ∉!$b.includes($a) @@ -58,20 +62,20 @@ So, these *might* be considered not too awful; -× * -¬ ! +× * ^k * X +¬ ! ^k N O ∕ / ≡ =:= ≔ := ⩴ or ≝ ::= - ≈ or ≊~~ + ≈ or ≊~~ ^k ? 2 … ... -√ sqrt() -∧ -∨ || +√ sqrt() ^k R T +∧ ^k A N +∨ || ^k O R ∣ mod (? bit of a stretch, perhaps) - ⌈$x⌉ceil($x) - ⌊$x⌋floor($x) + ⌈$x⌉ceil($x) ^k / 7 + ⌊$x⌋floor($x)^k 7 / 7 However I think it is a BAD idea that the following unicode characters
Re: Unicode Operators cheatsheet, please!
xOn 5/31/05, Sam Vilain [EMAIL PROTECTED] wrote: Rob Kinyon wrote: I would love to see a document (one per editor) that describes the Unicode characters in use and how to make them. The Set implementation in Pugs uses (at last count) 20 different Unicode characters as operators. I have updated the unicode quickref, and started a Perlmonks discussion node for this to be explored - see http://www.perlmonks.org/index.pl?node_id=462246 As I replied on Perlmonks, it would be more helpful if the Compose keys were listed and not just the ASCII versions. Plus, a quick primer on how to enable Unicode in your favorite editor. I don't know about Emacs, but the Vim documentation on multibyte is difficult to work with, at best. Thanks, Rob
Re: Unicode Operators cheatsheet, please!
Rob Kinyon wrote: I would love to see a document (one per editor) that describes the Unicode characters in use and how to make them. The Set implementation in Pugs uses (at last count) 20 different Unicode characters as operators. I have updated the unicode quickref, and started a Perlmonks discussion node for this to be explored - see http://www.perlmonks.org/index.pl?node_id=462246 Sam.
Re: Unicode Operators cheatsheet, please!
On Fri, May 27, 2005 at 10:29:39AM -0400, Rob Kinyon wrote: I would love to see a document (one per editor) that describes the Unicode characters in use and how to make them. The Set implementation in Pugs uses (at last count) 20 different Unicode characters as operators. Good idea. A modest start is at docs/quickref/unicode . -- Gaal Yahas [EMAIL PROTECTED] http://gaal.livejournal.com/
Re: Unicode Support - ICU Optional
On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote: WRT improving the ease of use of ICU. My suggestion is that a representative from each platform that Parrot is currently being built on download the latest stable version of ICU source, build it, and note anything special they needed to do to get it working. Those things should make putting a newer version into CVS a realistic possibility. I am volunteering for Cygwin (yeah I know - big surprise there). OK. Solaris, Sun C compilers. Notionally a supported platform. /usr/include/sys/feature_tests.h, line 277: #error: Compiler or options invalid for pre-UNIX 03 X/Open applications and pre-2001 POSIX applications cc: acomp failed for putil.c make[1]: *** [putil.d] Error 2 make[1]: Leaving directory `/export/home/nick/Ponie/ponie-clean/icu/source/common' make: *** [all-recursive] Error 2 It's this one again. Solaris 10 seems too new for it. OK, Solaris 10 is in beta but this is the same pain as before. I should report this to the ICU people. Note also that it will only build as is on this list of platforms for 3.0 The following names can be supplied as the argument for platform: AIX4.3xlC Use IBM's xlC on AIX 4.3 AIX4.3xlC_nothreads Use IBM's xlC on AIX 4.3 with no multithreading AIX4.3VAUse IBM's Visual Age xlC_r compiler on AIX 4.3 AIXGCC Use GCC on AIX ALPHA/LINUXGCC Use GCC on Alpha/Linux systems ALPHA/LINUXCCC Use Compaq C compiler on Alpha/Linux systems BeOSUse the GNU C++ compiler on BeOS Cygwin Use the GNU C++ compiler on Cygwin Cygwin/MSVC Use the Microsoft Visual C++ compiler on Cygwin FreeBSD Use the GNU C++ compiler on Free BSD HP-UX11ACC Use the Advanced C++ compiler on HP-UX 11 HP-UX11CC Use HP's C++ compiler on HP-UX 11 LinuxRedHat Use the GNU C++ compiler on Linux LINUX/ECC Use the Intel ECC compiler on Linux LINUX/ICC Use the Intel ICC compiler on Linux MacOSX Use the GNU C++ compiler on MacOS X (Darwin) QNX Use QNX's QCC compiler on QNX/Neutrino SOLARISCC Use Sun's CC compiler on Solaris SOLARISCC/W4.2 Use Sun's Workshop 4.2 CC compiler on Solaris SOLARISGCC Use the GNU C++ compiler on Solaris SOLARISX86 Use Sun's CC compiler on Solaris x86 TRU64V5.1/CXX Use Compaq's cxx compiler on Tru64 (OSF) zOS Use IBM's cxx compiler on z/OS (os/390) zOSV1R2 Use IBM's cxx compiler for z/OS 1.2 OS390V2R10 Use IBM's cxx compiler for OS/390 2.10 ie I'm being forced to build LP64 on Solaris. I will now try to scrape enough disk space on a friends Debian Sparc box to see what LinuxRedHat works like there. I won't have time to try Linux on ARM until Friday (at least) and I no longer have access to any other architectures running Debian. It may make sense to work with ICU initially, and it does support some of the more esoteric platforms that perl5 does (QNX, BeOS, EBCDIC mainframes) but I don't even see Irix in their list of supported platforms, let alone some of our more fun friends such as Unicos and NEC SuperUnix (or whatever the pain is called. Nice hardware; evil Unix) Heck, even OpenBSD isn't there. We would have to work with them quite a lot to bring ICU to the level of portability of Perl 5. Nicholas Clark
Re: Unicode Support - ICU Optional
On Thu, Aug 05, 2004 at 10:51:46AM +0100, Nicholas Clark wrote: On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote: WRT improving the ease of use of ICU. My suggestion is that a representative from each platform that Parrot is currently being built on download the latest stable version of ICU source, build it, and note anything special they needed to do to get it working. Those things should make putting a newer version into CVS a realistic possibility. I am volunteering for Cygwin (yeah I know - big surprise there). OK. Solaris, Sun C compilers. Notionally a supported platform. OK. AIX, gcc. Notionally a supported platform. $ /usr/bin/gmake rm -rf config/icu-config /opt/freeware/bin/install -c -m 644 ./config/icu-config-top config/icu-config sed -f ./config/make2sh.sed ./config/Makefile.inc | grep -v '#M#' | uniq config/icu-config sed -f ./config/make2sh.sed ./config/mh-aix-gcc | grep -v '#M#' | uniq config/icu-config cat ./config/icu-config-bottom config/icu-config echo # Rebuilt on `date` config/icu-config /bin/sh ./mkinstalldirs lib mkdir lib /bin/sh ./mkinstalldirs bin mkdir bin /usr/bin/gmake[0]: Making `all' in `stubdata' gmake[1]: Entering directory `/data_vx/nick-sandpit/build/icu/source/stubdata' generating dependency information for stubdata.c gmake[1]: Leaving directory `/data_vx/nick-sandpit/build/icu/source/stubdata' gmake[1]: Entering directory `/data_vx/nick-sandpit/build/icu/source/stubdata' /opt/freeware/GNUPro/bin/gcc -I../common -I../common -DHAVE_CONFIG_H -O3 -c -o stubdata.o stubdata.c rm -f libicudata30.0.a ; /opt/freeware/GNUPro/bin/gcc -O3 -Wl,-bbigtoc -shared -Wl,-bexpall -o libicudata30.0.a stubdata.o /opt/freeware/GNUPro/lib/gcc-lib/powerpc-ibm-aix5.1.0.0/2.9-aix51-020209/real-ld: target expall not found collect2: ld returned 1 exit status gmake[1]: *** [libicudata30.0.a] Error 1 gmake[1]: Leaving directory `/data_vx/nick-sandpit/build/icu/source/stubdata' gmake: *** [all-recursive] Error 2 This looks terminal. OTOH I know how to work around Solaris 10, so I'll report on that when it's finished. If I sound rude about this, it's because I know how portable the people who came before me managed to make Perl5, and I try to keep it that way. Nicholas Clark
Re: Unicode Support - ICU Optional
On Thu, Aug 05, 2004 at 10:51:46AM +0100, Nicholas Clark wrote: It's this one again. Solaris 10 seems too new for it. OK, Solaris 10 is in beta but this is the same pain as before. I should report this to the ICU people. Reported as bug #4047 ICU 3 will build, pass all tests and install if make is invoked as make CC=cc\ -D_XPG6 Libs are 32 bit. Heck, even OpenBSD isn't there. Or VMS. How could I miss VMS? Nicholas Clark
Re: Unicode Support - ICU Optional
On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote: WRT improving the ease of use of ICU. My suggestion is that a representative from each platform that Parrot is currently being built on download the latest stable version of ICU source, build it, and note anything special they needed to do to get it working. Those things should make putting a newer Builds OK on FreeBSD 5.2, but make check goes boom: /ucmptst/ ---[OK] ---/ucmptst/TestUCMP8API /tsformat/ /tsformat/ccaltst/ Segmentation fault (core dumped) gmake[2]: *** [check-local] Error 139 gmake[2]: Leaving directory `/home/nick/build/icu/source/test/cintltst' gmake[1]: *** [check-recursive] Error 2 gmake[1]: Leaving directory `/home/nick/build/icu/source/test' gmake: *** [check-recursive] Error 2 I've not looked into why yet. Nicholas Clark
Re: Unicode Support - ICU Optional
On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote: WRT improving the ease of use of ICU. My suggestion is that a representative from each platform that Parrot is currently being built on download the latest stable version of ICU source, build it, and note x86 Debian builds and tests just fine when ICU is configured with the platform LinuxRedHat The Sparc Debian box is down, so I can't see if that's LinuxRedHat too. Nicholas Clark
Re: Unicode Support - ICU Optional
On Thu, 5 Aug 2004, Nicholas Clark wrote: On Wed, Aug 04, 2004 at 04:10:56AM -0700, Joshua Gatcomb wrote: WRT improving the ease of use of ICU. My suggestion is that a representative from each platform that Parrot is currently being built on download the latest stable version of ICU source, build it, and note x86 Debian builds and tests just fine when ICU is configured with the platform LinuxRedHat The Sparc Debian box is down, so I can't see if that's LinuxRedHat too. I just successfully built icu-3.0 on a Debian/UltraSPARC system. The LinuxRedHat bit doesn't actually do anything beyond what a plain run of ./configure would do. I'm not sure quite what to think about ICU at the moment. I certainly agree ICU is complex and when it goes awry, it looks quite daunting to fix. Part of the issue is certainly that ICU is trying to do some hard things: 1. It builds 8 different shared libraries along the way. Presumably, as part of the build/test/install/use cycle, it needs to use those libraries, and not other versions of the icu libraries. As we know from dealing with shared libperl.so libraries, this is hard to do, and requires platform-specific information that's often impossible to guess. Similarly, it has to find data files, with a way to determine at run time where to find them. 2. It generates some stuff on-the-fly. Doing so portably (while correctly propagating all the various environment variables to get the right shared libraries and build tools) is, again, hard. 3. Correct implementation of some Unicode stuff is hard. Any competing system would likely have to address many of the same issues. I haven't dug into the build system deeply enough to have a sense of how much work would be involved in making it more portable. Some of the complexity is probably necessary because the problem itself is complex, but some of it has probably just evolved that way. perl5's Configure system certainly has a lot of both types of complexity. Finally, I note that ICU is still rapidly evolving. The 2.6 version in parrot right now that is obsolete is only 6 months old. That's both good and bad. It's good in that fixes could quickly flow both ways between the parrot and icu developers. It's bad, though, if the parrot version ends up forking significantly from the standard one, because it will be a lot of work to keep parrot's version in sync. So I don't know what to do with it at the moment; any alternative looks like a lot of work. -- Andy Dougherty [EMAIL PROTECTED]
Re: Unicode Support - ICU Optional
At 4:10 AM -0700 8/4/04, Joshua Gatcomb wrote: All: After speaking with Dan in #parrot last night, I either had originally misunderstood his position or he has changed it (paraphrased): We will ship Parrot with unicode support, but:. A. The unicode support does not necessarily need to be limited to a single library or ICU specifically. B. Just because CVS will have unicode support, does not mean the user will be forced to use it. C. Configure should detect a system unicode library and do the right thing in choosing which one it uses. Yup, you've got it. I *thought* that having ICU in would be more a win than a loss. Given the hell this has been putting people through I'm seriously changing my mind. We have a single requirement -- Parrot, as shipped, *must* have a working Unicode solution. It won't have to be configured when parrot's built, but it must at least be configurable. Right now, that solution's ICU. Longer term, well... longer term I dunno. So, here's the plan. 1) We beat up Configure to probe for and use the system ICU, if available. (Switches are needed now, it should be automagic) 2) I spec out the encoding and charset APIs for the loadable encoding and charset modules. (This is step one of teasing ICU out of the core) 3) We make Parrot's string system use the loadable encoding and charset system 4) We get non-unicode encodings and charsets in 5) We make ICU a loadable module tied into the proper encodings and charset Step 1 can be done by anyone willing to poke at the configure perl code. Step 2 needs me, and I'll get that done when I'm waiting for the train today. (No, don't ask) Step 3 is the biggie here, as it touches a lot of string.c. 4's relatively easy (7-bit ASCII and binary'll be first :) and 5 may or may not be straightforward, depending on how the design goes. I'd like to get work on steps 3 and 4 going quickly -- the sooner the better -- once the API design's done. And yes, the API will support doing this in bytecode, though there'll be the obligatory performance penalty, so if someone later comes along and wants to reimplement the Unicode support in a parrot language, well... that'd be keen and we could toss ICU from the distribution entirely. (Though still use it if there's a system version installed) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode step by step
On Saturday 10 April 2004 15:13, Leopold Toetsch wrote: There is of course still the question: Should we really have ICU in the tree. This needs tracking updates and patching (again) to make it build and so on. In the sake of platform independence I'd say to keep it there. It's far easier if you have only the usual build dependencies and the one special thing inside the tree to quick test on different platforms. What I want to say that you'll find a sane build environment and a Perl on most of the machines, but even I don't have ICU installed. BTW, it doesn't compile on any platform at the moment, after a realclean on the first make it complains about ../data/locales/ja.txt:15: parse error. Stopped parsing with U_INVALID_FORMAT_ERROR couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3 If you do a make at this point again, it skips these steps and tries to link parrot, failing on many undefined symbols, I believe from the non existent ICU. Thanks, leo Have fun, Marcus -- :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: I can resist anything but temptation Oscar Wilde pgp0.pgp Description: signature
Re: Unicode step by step
BTW, it doesn't compile on any platform at the moment, after a realclean on the first make it complains about ../data/locales/ja.txt:15: parse error. Stopped parsing with U_INVALID_FORMAT_ERROR couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3 Try a make realclean first--Dan checked in a fix for this, and it seems to require this to force everything to start fresh. If you do a make at this point again, it skips these steps and tries to link parrot, failing on many undefined symbols, I believe from the non existent ICU. At this point I'd expect it to link, but maybe not run well--that failure comes when packaging up the data files, and at that point the the libraries themselves should already be built and in the right place. But you are detecting some loose behavior in the Makefile, which was done in part so that ICU wouldn't rebuild unless you make clean. JEff
Re: Unicode step by step
just a confirmation... my i386 debian linux gives the same error repeatedly after make realclean, if i make again, it compiles a broken parrot which fails (too) many tests... also it seems (to me) that parrot's configured choice of compiler, linker, ... is not used in building icu? does icu have some non-ubiquitous dependencies? LF ../data/locales/ja.txt:15: parse error. Stopped parsing with U_INVALID_FORMAT_ERROR couldn't parse the file ja.txt. Error:U_INVALID_FORMAT_ERROR make[1]: *** [../data/out/build/icudt26l_ja.res] Error 3 Try a make realclean first--Dan checked in a fix for this, and it seems to require this to force everything to start fresh. If you do a make at this point again, it skips these steps and tries to link parrot, failing on many undefined symbols, I believe from the non existent ICU. At this point I'd expect it to link, but maybe not run well--that failure comes when packaging up the data files, and at that point the the libraries themselves should already be built and in the right place. But you are detecting some loose behavior in the Makefile, which was done in part so that ICU wouldn't rebuild unless you make clean.
Re: Unicode step by step
On Tuesday 13 April 2004 13:28, luka frelih wrote: just a confirmation... my i386 debian linux gives the same error repeatedly after make realclean, if i make again, it compiles a broken parrot which fails (too) many tests... also it seems (to me) that parrot's configured choice of compiler, linker, ... is not used in building icu? does icu have some non-ubiquitous dependencies? As I said yesterday, it worked on a machine of mine which I hadn't touched for quite some while. On my notebook, where I do daily builds, I ran in the same problem, even after having made a realclean. So I did a make clean in the icu subdir directly, deleted all files which are listed in .cvsignore and ran the realclean configure build test all over and now it works. Seems as if something doesn't get cleaned up in icu with a parrot realclean. Have fun, Marcus -- :: Marcus Thiesen :: www.thiesen.org :: ICQ#108989768 :: 0x754675F2 :: Do something every day that you don't want to do; this is the golden rule for acquiring the habit of doing your duty without pain Mark Twain pgp0.pgp Description: signature
Re: Unicode step by step
Marcus Thiesen wrote: . Seems as if something doesn't get cleaned up in icu with a parrot realclean. Yep. I've removed cleaning icu from clean/realclean[1]. $ make help | grep clean ... icu.clean: ... And there is always make cvsclean. Have fun, Marcus leo [1] If anyone puts that in again he might also send a lot faster PC to me (and possibly other developers ;)
Re: Unicode step by step
At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote: Marcus Thiesen wrote: . Seems as if something doesn't get cleaned up in icu with a parrot realclean. Yep. I've removed cleaning icu from clean/realclean[1]. I think we need to put that back for a bit, but with this: [1] If anyone puts that in again he might also send a lot faster PC to me (and possibly other developers ;) We're also likely going to be well-off if we get configure to detect a system ICU install and use that instead. It shouldn't be that tough, but I've not had a chance to poke around in the icu part of our config system to find out what we need to do. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode step by step
Dan Sugalski [EMAIL PROTECTED] wrote: At 6:22 PM +0200 4/13/04, Leopold Toetsch wrote: Marcus Thiesen wrote: . Seems as if something doesn't get cleaned up in icu with a parrot realclean. Yep. I've removed cleaning icu from clean/realclean[1]. I think we need to put that back for a bit, I did list two alternatives. The normal way of changes doesn't include changes to ICU source (and honestly shouldn't). Currently building is still a bit in flux, which does mandate a make icu.clean. And there is of course already a new ICU version on *their* website, but we still try to get/keep 2.6 running. I'm still not sure that this lib should be part of *our* tree ... ... but with this: [1] If anyone puts that in again he might also send a lot faster PC to me (and possibly other developers ;) We're also likely going to be well-off if we get configure to detect a system ICU install and use that instead. There are severals issues: First one is MANIFEST and CVS and patches. Config steps should be simple. But - of course - I'd appreciate this alternative as already layed out. leo
Re: [PATCH] Re: Unicode step by step
Jeff Clites [EMAIL PROTECTED] wrote: Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc. Works. If it's working correctly, the attached strings-and-byte-order.* should both do the same thing--output the Angstrom symbol. If it's wrong, then the pbc version should output junk on a little-endian system. (If your terminal emulator isn't prepared to handle UTF-8, then pipe the output through 'less', and you should see something like E284AB.) $ parrot string_1.pbc Å $ parrot string_1.pbc | od -tx1 000 e2 84 ab 0a 004 JEff Thanks - I'll apply it RSN. leo
Re: Unicode step by step
On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote: 2) String PBC layout. The internal string type has changed. This currently breaks native_pbc tests (that have strings) as well as some parrot xx.pbc tests related to strings. These are working for me (which tests are failing for you?)--I did update the PF_* API to match the changes to string internals. Of course, since the internals changed the pbc layout changed also, so the native_pbc test files need to be regenerated on the various platforms--but the ppc one I submitted (see other post, or original patch submission) should work. But if that one fails for you, it's probably b/c of byte order, and I need to look and find where we do the endianness correction for integers in pbc files, and hook in to do something similar for certain string cases. If someone can send me a number_X.pbc file generated on an i386 platform, that will help me test. But, it's correct that there's no backward-compatibility code in place, to allow reading old pbc files. Do we want to have that sort of thing at this stage? (Certainly, I'd think that after 1.0 we'd want backward compatibility with any format changes, but do we need it at this stage?) But let me know which parrot xx.pbc tests are failing for you. The layout seems to depend somehow on the supported Unicode levels (or not). So before fixing the PBC issues, I'd just have a statememt: parrot_string_t looks such and such or of course as is now. Could you rephrase? I'm not understanding what you are saying. The only real change in the pbc format (if I'm recalling correctly--I'll have to go back and look) are that rather than serializing the encoding/chartype/language triple, we are writing out the s-representation (still followed by s-bufused and then the contents of the buffer). The only other wrinkle is that for cases where s-representation is 2 or 4, we need to endianness correct when we use the bytecode. This is probably a separate discussion, but we _could_ decide instead to represent strings in pbc files always in UTF-8. Advantage: Simpler, no endianness correction needed, probably durable to further changes in string internals, could isolate s-representation awareness to string.c and string_primitives.c. Disadvantages: De-serializing a string from a pbc file will always involve a copy, and could result in larger files in some cases. I could argue it either way--one's cleaner, the other is probably faster. There is of course still the question: Should we really have ICU in the tree. This needs tracking updates and patching (again) to make it build and so on. One consideration is that I may need to patch ICU a few places--there's at least one API which they only expose in C++, so I need to wrap it in C and it's cleaner to do that as a patch to ICU rather than having C++ code in the core of parrot. Other than that, I think it boils down to convenience, and (possibly) consistency in being able to say that parrot version foo corresponds to ICU version bar (but maybe we don't need to be able to say that). JEff
[PATCH] Re: Unicode step by step
On Apr 10, 2004, at 1:12 PM, Jeff Clites wrote: On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote: 2) String PBC layout. The internal string type has changed. This currently breaks native_pbc tests (that have strings) as well as some parrot xx.pbc tests related to strings. These are working for me (which tests are failing for you?)--I did update the PF_* API to match the changes to string internals. Of course, since the internals changed the pbc layout changed also, so the native_pbc test files need to be regenerated on the various platforms--but the ppc one I submitted (see other post, or original patch submission) should work. But if that one fails for you, it's probably b/c of byte order, and I need to look and find where we do the endianness correction for integers in pbc files, and hook in to do something similar for certain string cases. Here's a patch to src/pf_items.c, and a ppc t/native_pbc/number_3.pbc. If it's working correctly, the attached strings-and-byte-order.* should both do the same thing--output the Angstrom symbol. If it's wrong, then the pbc version should output junk on a little-endian system. (If your terminal emulator isn't prepared to handle UTF-8, then pipe the output through 'less', and you should see something like E284AB.) (PS--I had to give the pbc file a fake extension, to keep the develooper mail server from rejecting it.) JEff pf_items_c.patch Description: Binary data number_3.pbc Description: Binary data strings-and-byte-order.pasm Description: Binary data strings-and-byte-order.pbc.file Description: Binary data
Re: Unicode support in Emacs
Karl == Karl Brodowsky [EMAIL PROTECTED] writes: I get the impression that Unicode-support has kind of gone on top of this stuff and I must admit that the way I am currently using Unicode is to edit the stuff with \ucafe\ubabe-kind of replacements and run perlscripts to convert for example my private html-format into WWW-html. Um. That sounds like a lot of work... XEmacs handles Unicode and UTF-8 quite well, and has for the last couple of years[1]. It may have problems that I don't know of if you dig sufficiently far down, and it may not cooperate flawlessly with all possible major and minor modes, but it's at least good enough for me to edit XML documents in UTF-8 and to read UTF-8-encoded News postings and mail without problems. The most difficult bit has been to find a Unicode font that isn't butt-ugly. [1] In the 21.4.x series you need to install a lisp module (which the XEmacs package system will do for you if you ask it; it's seriously inspired by CPAN) and add three lines to your .emacs. In 21.5.x it should all Just Work. -- Calle Dybedahl [EMAIL PROTECTED] http://www.livejournal.com/users/cdybedahl/ Last week was a nightmare, never to be repeated - until this week -- Tom, a.s.r
Re: Unicode in Emacs (was: Semantics of vector operations)
On Feb 03, David Wheeler wrote: On Feb 3, 2004, at 7:13 AM, Kurt Starsinic wrote: No joke. You'll need to have the mule-ucs module installed. A quick Google search turns up plenty of sources. Oh, I have Emacs 21.3.50. Mule is gone. I'm afraid you're on your own, then. I'm using 21.3.1. If you work it out, please post. - Kurt
Re: Unicode under Windows (was RE: Semantics of vector operations)
Austin Hastings wrote: From: Rod Adams [mailto:[EMAIL PROTECTED] Question in all this: What does one do when they have to _debug_ some code that was written with these lovely Unicode ops, all while stuck in an ASCII world? That's why I suggested a standard script for Unicode2Ascii be shipped with the distro. Good Idea, which would also beg an ASCII2Unicode script to reverse the process. Also, isn't it a pain to type all these characters when they are not on your keyboard? As a predominately Win2k/XP user in the US, I see all these glyphs just fine, but having to remember Alt+0171 for a is going to get old fast... I much sooner go ahead and write Eraquo and be done with it. Thoughts? This has been discussed a bunch of times, but for Windows users the very best thing in the US is to change your Start Settings Control Panel Keyboard Input Locales so that you have the option of switching over to a United States-International IME. Once you've got that available (I used the Left-Alt+Shift hotkey) you can make a map of the keys. The only significant drawback is the behavior of the quote character, since it is used to encode accent marks. It takes getting used to the quote+space behavior, or defining a macro key (hint, hint). (Links Snipped) Thanks for the pointers. I've now set up Win2k so I can easily switch between US and United States International. Works nicely. Now I have to go beat up the Thunderbird guys for trapping the keyboard directly and not allowing me to type the chars here. Thanks Again -- Rod
RE: Unicode under Windows (was RE: Semantics of vector operations)
-Original Message- From: Austin Hastings [mailto:[EMAIL PROTECTED] From: Rod Adams [mailto:[EMAIL PROTECTED] Question in all this: What does one do when they have to _debug_ some code that was written with these lovely Unicode ops, all while stuck in an ASCII world? That's why I suggested a standard script for Unicode2Ascii be shipped with the distro. Also, isn't it a pain to type all these characters when they are not on your keyboard? As a predominately Win2k/XP user in the US, I see all these glyphs just fine, but having to remember Alt+0171 for a is going to get old fast... I much sooner go ahead and write Eraquo and be done with it. Thoughts? This has been discussed a bunch of times, but for Windows users the very best thing in the US is to change your Start Settings Control Panel Keyboard Input Locales so that you have the option of switching over to a United States-International IME. Once you've got that available (I used the Left-Alt+Shift hotkey) you can make a map of the keys. The only significant drawback is the behavior of the quote character, since it is used to encode accent marks. It takes getting used to the quote+space behavior, or defining a macro key (hint, hint). Sorry for the self-reply, but here's some links for you: These guys sell an overlay, and include a picture of the overlay, for US-Int keyboarding: http://www.datacal.com/dce/catalog/us-international-layout.htm Some extra information with a Francophonic spin to it: http://www.lehman.cuny.edu/depts/langlit/labs/keyboard.htm A more complete keyboard diagram: http://www.worldnames.net/ML_input/InternationalKeyboard.cfm From the horse's mouth there is an interesting applet: http://www.microsoft.com/globaldev/reference/keyboards.aspx Finally, you could define your OWN keyboard layout using this tool (requires .NET install): http://www.microsoft.com/globaldev/tools/msklc.mspx =Austin
Re: Unicode, internationalization, C++, and ICU
Maybe we can use someone else's solution... http://lists.ximian.com/archives/public/mono-list/2003-November/ 016731.html On 16 Jan 2004, at 00:33, Jonathan Worthington wrote: - Original Message - From: Dan Sugalski [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 15, 2004 8:09 PM Subject: Unicode, internationalization, C++, and ICU Now, assuming there's still anyone left reading this message... We've been threatening to build ICU into parrot, and it's time for that to start happening. Unfortunately there's a problem--it doesn't work right now. So, what we need is some brave soul to track ICU development and keep us reasonably up to date. What I'd really like is: 1) ICU building and working 2) ICU not needing any C++ I've done some testing, and I hate to be the bearer of bad news but I believe we have something of a problem. :-( The configure script turns out to be a shell script which, unless I'm mistaken, means we're currently unable to build ICU anywhere we don't have bash or similar. Win32 for starters, which is where I'm testing. A possible solution might be to re-write the configure script in Perl - though we'd have to keep it maintained as we do ICU updates. Another one, for Win32 at least, is that we *might* be able to use UNIX Services For Win32 and run configure under that, generate a Win32 makefile and just copy it in place with the configure script. Less portable to other places with the same problem, though, and again we have to maintain it as we update ICU. There is also a problem with the configure stage on Win32, but that's an aside until the above issue is sorted out. I also gave it a spin in cygwin, where the configure script for ICU runs OK, but there's no C++ compiler so it doesn't get built. Thoughts? Jonathan
Re: Unicode, internationalization, C++, and ICU
At 10:40 AM +0100 1/16/04, Michael Scott wrote: Maybe we can use someone else's solution... http://lists.ximian.com/archives/public/mono-list/2003-November/016731.html Could be handy. We really ought to detect a system-installed ICU and use that rather than our local copy at configure time, if it's of an appropriate version. That'd at least avoid having two copies, and potentially get us some system-wide runtime memory savings. - Original Message - From: Dan Sugalski [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 15, 2004 8:09 PM Subject: Unicode, internationalization, C++, and ICU Now, assuming there's still anyone left reading this message... We've been threatening to build ICU into parrot, and it's time for that to start happening. Unfortunately there's a problem--it doesn't work right now. So, what we need is some brave soul to track ICU development and keep us reasonably up to date. What I'd really like is: 1) ICU building and working 2) ICU not needing any C++ I've done some testing, and I hate to be the bearer of bad news but I believe we have something of a problem. :-( The configure script turns out to be a shell script which, unless I'm mistaken, means we're currently unable to build ICU anywhere we don't have bash or similar. Win32 for starters, which is where I'm testing. A possible solution might be to re-write the configure script in Perl - though we'd have to keep it maintained as we do ICU updates. Another one, for Win32 at least, is that we *might* be able to use UNIX Services For Win32 and run configure under that, generate a Win32 makefile and just copy it in place with the configure script. Less portable to other places with the same problem, though, and again we have to maintain it as we update ICU. There is also a problem with the configure stage on Win32, but that's an aside until the above issue is sorted out. I also gave it a spin in cygwin, where the configure script for ICU runs OK, but there's no C++ compiler so it doesn't get built. Thoughts? Jonathan -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode, internationalization, C++, and ICU
On Jan 15, 2004, at 3:33 PM, Jonathan Worthington wrote: - Original Message - From: Dan Sugalski [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 15, 2004 8:09 PM Subject: Unicode, internationalization, C++, and ICU Now, assuming there's still anyone left reading this message... We've been threatening to build ICU into parrot, and it's time for that to start happening. Unfortunately there's a problem--it doesn't work right now. So, what we need is some brave soul to track ICU development and keep us reasonably up to date. What I'd really like is: 1) ICU building and working 2) ICU not needing any C++ I've done some testing, and I hate to be the bearer of bad news but I believe we have something of a problem. :-( The configure script turns out to be a shell script which, unless I'm mistaken, means we're currently unable to build ICU anywhere we don't have bash or similar. Win32 for starters, which is where I'm testing. This page give instructions for building on Windows--it doesn't seem to require installing bash or anything: http://oss.software.ibm.com/cvs/icu/~checkout~/icu/ readme.html#HowToBuildWindows I assume that on Windows you don't need to run the configure script. JEff
Re: Unicode, internationalization, C++, and ICU
snip This page give instructions for building on Windows--it doesn't seem to require installing bash or anything: http://oss.software.ibm.com/cvs/icu/~checkout~/icu/ readme.html#HowToBuildWindows I assume that on Windows you don't need to run the configure script. Thanks for that, I'll work on and test a patch for the Configure script to do this on Win32 later. It won't help with any compiler other than MSVC++, but it certainly helps. Thanks, Jonathan
Re: Unicode, internationalization, C++, and ICU
Well I did originally have this in mind, but the more I looked into it the more I thought it needed someone with unicode experience. It seems to me that the unicode world is full of ah but in North Icelandic Yiddish aleph is considered to be an infinitely composite character and other such arcane exceptions that make the inexperienced the natural victims of their own rational assumptions. Also, given the icu-not-building problem (http://www.mail-archive.com/[EMAIL PROTECTED]/msg17477.html) maybe what we need is an icu person per platform. This might have the benefit of making the task seem less onerous. I did manage to get it building on OS X (still does, I just checked). I wonder on what systems is it actually failing? I'll include this wiki page again because it contains a few links that unicode-savvy lurkers might find useful. http://www.vendian.org/parrot/wiki/bin/view.cgi/Main/ ParrotDistributionUnicodeSupport Mike On 15 Jan 2004, at 21:09, Dan Sugalski wrote: Now, assuming there's still anyone left reading this message... We've been threatening to build ICU into parrot, and it's time for that to start happening. Unfortunately there's a problem--it doesn't work right now. So, what we need is some brave soul to track ICU development and keep us reasonably up to date. What I'd really like is: 1) ICU building and working 2) ICU not needing any C++ I'd also like a pony, too, so I can live if we don't get #2, at least for a bit (as it means that we now require a C++ compiler to build parrot). Anyone care to volunteer? -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode, internationalization, C++, and ICU
- Original Message - From: Dan Sugalski [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 15, 2004 8:09 PM Subject: Unicode, internationalization, C++, and ICU Now, assuming there's still anyone left reading this message... We've been threatening to build ICU into parrot, and it's time for that to start happening. Unfortunately there's a problem--it doesn't work right now. So, what we need is some brave soul to track ICU development and keep us reasonably up to date. What I'd really like is: 1) ICU building and working 2) ICU not needing any C++ I've done some testing, and I hate to be the bearer of bad news but I believe we have something of a problem. :-( The configure script turns out to be a shell script which, unless I'm mistaken, means we're currently unable to build ICU anywhere we don't have bash or similar. Win32 for starters, which is where I'm testing. A possible solution might be to re-write the configure script in Perl - though we'd have to keep it maintained as we do ICU updates. Another one, for Win32 at least, is that we *might* be able to use UNIX Services For Win32 and run configure under that, generate a Win32 makefile and just copy it in place with the configure script. Less portable to other places with the same problem, though, and again we have to maintain it as we update ICU. There is also a problem with the configure stage on Win32, but that's an aside until the above issue is sorted out. I also gave it a spin in cygwin, where the configure script for ICU runs OK, but there's no C++ compiler so it doesn't get built. Thoughts? Jonathan
Re: Unicode operators
At 1:27 PM -0800 11/6/02, Brad Hughes wrote: Flaviu Turean wrote: [...] 5. if you want to wait for the computing platforms before programming in p6, then there is quite a wait ahead. how about platforms which will never catch up? VMS, anyone? Not to start an OS war thread or anything, but why do people still have this mistaken impression of VMS? We have compilers and hard drives and networking and everything. We even have color monitors. Sure, we lack a decent c++ compiler, but we consider that a feature. :-) Lacking a decent C++ compiler isn't necessarily a strike against VMS--to be a strike against, there'd actually have to *be* a decent C++ compiler... -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
vote no - Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
The first message had many of the following characters viewable in my telnet window, but the repost introduced a 0xC2 prefix to the 0xA7 character. I have this feeling that many people would vote against posting all these funny characters, as is does make reading the perl6 mailing lists difficult in some contexts. Ever since introducing these UTF-8 127 characters into this mailing list, I can never be sure of what the posting author intended to send. I'm all for supporting UTF-8 characters in strings, and perhaps even in variable names but to we really have to have perl6 programs with core operators in UTF-8. I'd like to see all the perl6 code that had UTF-8 operators start with use non_portable_utf8_operators. As it stands now, I'm going to have to find new tools for my linux platform that has been performing fine since 1995 (perl5.9 still supports libc5!), and I don't yet know how I am going to be able to telnet in from win98, and I'll bet that the dos kermit that I use when I dial up won't support UTF-8 characters either. David ps. I just read how many people will need to upgrade their operating systems if the want to upgrade to MS Word11. Do we want to require operating system and/or many support tools to be upgraded before we can share perl6 scripts via email? On Tue, 5 Nov 2002 at 09:56 -0800, Michael Lazzaro [EMAIL PROTECTED]: CodeSymbol Comment 167 § Could be used 169 © Could be used 171 « May well be used 172 ¬ Not? 174 ® Could be used 176 ° Could be used 177 ± Introduces an interesting level of uncertainty? Useable 181 µ Could be used 182 ¶ Could be used 186 º Could be used (but I dislike it as it is alphabetic) 187 » May well be used 191 ¿ Could be used
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
This UTF discussion has got silly. I am sitting at a computer that is operating in native Latin-1 and is quite happy - there is no likelyhood that UTF* is ever likely to reach it. The Gillemets are coming through fine, but most of the other heiroglyphs need a lot to be desired. Lets consider the coding comparisons. Chars in the range 128-159 are not defined in Latin-1 (issue 1) and are used differently by windows to Latin-1 (later issues) so should be avoided. Chars in the range 160-191 (which include the gillemot) are coming through fine if encoded by the sender as UTF8. Anything in the range 192-255 is encoded differently and thus should be avoided. Therefore the only addition characters that could be used, that will work under UTF8 and Latin-1 and Windows are: CodeSymbol Comment 160 Non-breaking space (map to normal whitespace) 161 ¡ Could be used 162 ¢ Could be used 163 £ Could be used 164 ¤ Could be used 165 ¥ Could be used 166 ¦ Could be used 167 § Could be used 168 ¨ Could be used thouugh risks confusion with 169 © Could be used 170 ª Could be used (but I dislike it as it is alphabetic) 171 « May well be used 172 ¬ Not? 173 Nonbreaking - treat as the same 174 ® Could be used 175 ¯ May cause confusion with _ and - 176 ° Could be used 177 ± Introduces an interesting level of uncertainty? Useable 178 ² To the power of 2 (squaring ? ) Otherwise best avoided 179 ³ Cubing? Otherwise best avoided 180 ´ Too confusing with ' and ` 181 µ Could be used 182 ¶ Could be used 183 · Dot Product? though likely to be confused with . 184 ¸ treat as , 185 ¹ To the power 1? Probably best avoided 186 º Could be used (but I dislike it as it is alphabetic) 187 » May well be used 188 ¼ Could be used 189 ½ Could be used 190 ¾ Could be used 191 ¿ Could be used Richard -- Personal [EMAIL PROTECTED]http://www.waveney.org Telecoms [EMAIL PROTECTED] http://www.WaveneyConsulting.com Web services [EMAIL PROTECTED]http://www.wavwebs.com Independent Telecomms Specialist, ATM expert, Web Analyst Services
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
Thanks, I've been hoping for someone to post that list. Taking it one step further, we can assume that the only chars that can be used are those which: -- don't have an obvious meaning that needs to be reserved -- appear decently on all platforms -- are distinct and recognizable in the tiny font sizes used when programming Comparing your list with mine, with some subjective editing based on my small courier font, that chops the list of usable operators down to only a handful: Code Symbol Comment 167 § Could be used 169 © Could be used 171 « May well be used 172 ¬ Not? 174 ® Could be used 176 ° Could be used 177 ± Introduces an interesting level of uncertainty? Useable 181 µ Could be used 182 ¶ Could be used 186 º Could be used (but I dislike it as it is alphabetic) 187 » May well be used 191 ¿ Could be used That's all. A shame, because some of the others have very interesting possibilities: • ≠ ø † ∑ ∂ ƒ ∆ ≤ ≥ ∫ ≈ Ω ‡ ± ˇ ∏ Æ But if Windows can't easily do them, that's a pretty big problem. Thanks for the list. MikeL
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
I'm all for one or two unicode operators if they're chosen properly (and I trust Larry to do that since he's done a stellar job so far), but what's the mechanism to generate unicode operators if you don't have access to a unicode-aware editor/terminal/font/etc.? IS the only recourse to use the named versions? Or will there be some sort of digraph/trigraph/whatever sequence that always gives us the operator we need? Something like \x[263a] but in regular code and not just quote-ish contexts: $campers = $a \x[263a] $b # make $a and $b happy -Scott -- Jonathan Scott Duff [EMAIL PROTECTED]
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
Dan Kogai wrote: We already have source filters in perl5 and I'm pretty much sure someone will just invent yet another 'use operators = ascii;' kind of stuff in perl6. I think that's backwards to have operators being funny characters by default but requiring explicit declaration to use well-known Ascii characters. Doing it t'other way round would mean that you can always write fully portable code fragments in pure Ascii, something that'd be helpful on mailing lists and the like. There could be an alias syntax for people in an environment where they'd prefer to have a non-Ascii character in place of a conglomerate of Ascii symbols, maybe: treat '»...«' as '[...]'; That has the documentational advantage that any non-Ascii character used in code must be declared earlier in that file. And even if the non-Ascii character gets warped in the post and displays oddly for you, you can still see what the author intended it to do. This has the risk that Damian described of everybody defining their own operators, but I think that's unlikely. There's likely to be a convention used by many people, at least those who operate in a given character set. This way also permits those who live in a Latin 2 (or whatever) world to have their own convention using characters that make sense to them. Smylers
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
Richard Proctor wrote: I am sitting at a computer that is operating in native Latin-1 and is quite happy - there is no likelyhood that UTF* is ever likely to reach it. ... Therefore the only addition characters that could be used, that will work under UTF8 and Latin-1 and Windows ... What about people who don't use Latin-1, perhaps because their native language uses Latin-2 or some other character set mutually exclusive with Latin-1? I don't have a Latin-2 ('Central and East European languages') typeface handy, but its manpage includes: 253 171 AB LATIN CAPITAL LETTER T WITH CARON 273 187 BB LATIN SMALL LETTER T WITH CARON Caron is sadly missing from my dictionary so I'm not sure what those would look like, but I suspect they wouldn't be great symbols for vector operators. 171 « May well be used Also I wonder how similar to doubled less-than or greater-than signs guillemets would look. In this font they're fine, but I'm concerned at my abilities to make them sufficiently distinguishable on a whiteboard, and whether publishers will cope with them (compare a recent discussion on 'use Perl' regarding curly quotes and fi ligatures appearing in code samples). Smylers
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
On Tue 05 Nov, Smylers wrote: Richard Proctor wrote: I am sitting at a computer that is operating in native Latin-1 and is quite happy - there is no likelyhood that UTF* is ever likely to reach it. ... Therefore the only addition characters that could be used, that will work under UTF8 and Latin-1 and Windows ... What about people who don't use Latin-1, perhaps because their native language uses Latin-2 or some other character set mutually exclusive with Latin-1? Once you go beyond latin-1 there is nothing common anyway. The Gullimots become T and t with inverted hats under Latin-2, oe and G with an inverted hat under Latin-3, oe and G with a squiggle under it under Latin-4, No meaning and a stylisd K for Latin-5, (cant find latin6), Gullimots under Latin 7, nothing under latin-8. Richard -- Personal [EMAIL PROTECTED]http://www.waveney.org Telecoms [EMAIL PROTECTED] http://www.WaveneyConsulting.com Web services [EMAIL PROTECTED]http://www.wavwebs.com Independent Telecomms Specialist, ATM expert, Web Analyst Services
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
As one of the instigators of this thread, I submit that we've probably argued about the Unicode stuff enough. The basic issues are now known, and it's known that there's no general agreement on any of this stuff, nor will there ever be. To wit: -- Extended glyphs might be extremely useful in extending the operator table in non-ambiguous ways, especially for advanced things like «op».. -- Many people loathe the idea, and predict newcomers will too. -- Many mailers older platforms tend to react badly for both viewing and inputting. -- If extended characters are used at all, the decision needs to be made whether they shall be least-common-denominator Latin1, UTF-8, or full Unicode, and if there are backup spellings so that everyone can play. It's up to Larry, and he knows where we're all coming from. Unless anyone has any _new_ observations, I propose we pause the debate until a decision is reached? MikeL
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
Scott Duff wrote: I'm all for one or two unicode operators if they're chosen properly (and I trust Larry to do that since he's done a stellar job so far), but what's the mechanism to generate unicode operators if you don't have access to a unicode-aware editor/terminal/font/etc.? IS the only recourse to use the named versions? Or will there be some sort of digraph/trigraph/whatever sequence that always gives us the operator we need? Something like \x[263a] but in regular code and not just quote-ish contexts: $campers = $a \x[263a] $b # make $a and $b happy That would probably be: $campers = $a \c[263a] $b # make $a and $b happy if it were allowed (which I suspect it mightn't be, since it looks rather like an index on a reference to the value returned by a call to the subroutine Cc. Incidentally, this is why I previously suggested that we might allow POD escapes in code as well. Thus: $campers = $a E263a $b # make $a and $b happy Damian
Re: Unicode operators [Was: Re: UTF-8 and Unicode FAQ, demos]
Michael Lazzaro proposed: It's up to Larry, and he knows where we're all coming from. Unless anyone has any _new_ observations, I propose we pause the debate until a decision is reached? I second the motion! Damian
RE: Unicode thoughts...
At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) --Josh At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote: At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
At 10:07 AM -0500 3/30/02, Josh Wilmes wrote: Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) If the C++ bits are redoable as C, I'm OK with it. I've not taken a good look at it to know how much it depends on C++. If it's mostly // comments and such we can work around the issues easily enough. If its objects, well, I suppose it depends on how much it relies on them. At 8:45 on 03/30/2002 EST, Dan Sugalski [EMAIL PROTECTED] wrote: At 4:32 PM -0800 3/25/02, Brent Dax wrote: I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. FWIW, ICU in the distribution is a given if we use it. Parrot will require a C compiler and link tools (maybe make, but maybe not) to build on a target platform and nothing else. If we rely on ICU we must ship with it. -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode thoughts...
Dan Sugalski wrote: At 10:07 AM -0500 3/30/02, Josh Wilmes wrote: Someone said that ICU requires a C++ compiler. That's concerning to me, as is the issue of how we bootstrap our build process. We were planning on a platform-neutral miniparrot, and IMHO that can't include ICU (as i'm sure it's not going to be written in pure ansi C) If the C++ bits are redoable as C, I'm OK with it. I've not taken a good look at it to know how much it depends on C++. If it's mostly // comments and such we can work around the issues easily enough. If its objects, well, I suppose it depends on how much it relies on them. Looking at icu/common I see more .c files than .cpp files, and what .cpp files there are look somewhat like wrappers around the C code. In addition, the .cpp files appear to do such things as create iterator wrappers, bidirectional display and normalization. Some of the files are thicker than wrappers, but I think there's enough code behind this C++ veneer that we can at least use it if not the entire library. -- Jeff [EMAIL PROTECTED]
RE: Unicode thoughts...
Jeff: # This will likely open yet another can of worms, but Unicode has been # delayed for too long, I think. It's time to add the Unicode libraries # (In our case, the ICU libraries at http://oss.software.ibm.com/icu/, # which Larry has now blessed) to Parrot. string.c already has # (admittedly # unavoidable, due to the library not being included) # assumptions such as # isdigit(). So, I have a few thoughts (that may have already been shot # down by people wiser than I in such matters) to explicate, and some # questions to ask. # # ICU should be added as a note in the README, and maybe to 'INSTALL' if # we ever create one. Let's not add it to CVS, as it's not under our # control. If we have to patch ICU to make it work correctly # with Parrot, # the patches should be submitted back to the ICU team. And I'm joining # the appropriate mailing lists to keep appraised of development. I *really* strongly suggest we include ICU in the distribution. I recently had to turn off mod_ssl in the Apache 2 distro because I couldn't get OpenSSL downloaded and configured. We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? # Before Unicode goes into full swing, I need some idea of how # we're going # to deploy the libraries. On this note, I defer to the # Configure master, # Brent. I've already done some work with ICU, so I'm reasonably # comfortable with migrating in one Unicode bit at a time, until we're # ready for full UTF-16 compliance. # # The RE engine should (I'm speaking without having recently read the # source, so feel free to correct me) not need to be migrated, as it's # already using UTF-32 internally, which leaves just the string # internals. # These can be migrated to using ICU macros fairly easily (I've already # done some of the work locally), so I think the main focus should be on # encodings, as we'll have to eventually support the more common # wide-character encodings such as KOI-8 and BIG5. There are a few things that need to change, but they aren't big issues. Mostly it's just places where character sets have been presumed. However, I'm seriously thinking about a major re-architecture of the regex engine, which would probably help these sorts of issues. # I still have some questions about using UTF-16 internally for string # representation (as mentioned in # http:[EMAIL PROTECTED]/msg07856.html), # but I've resolved most of those. It's an excellent match for the ICU # library, as it uses UTF-16 internally. My only question is if we're # going to incur a performance hit every time a scalar is transferred to # the RE engine, as it uses UTF-32 internally. That can change. However, utf32 seems like the best match, as it would allow us to reach into a string's guts for speed. (We don't currently do that, but if I do redesign the engine, I'll probably be able to.) # Also, once we have UTF-16 running internally, I'd be interested in # seeing what memory consumption looks like vs. UTF-32, beause I'd like to # see if it makes sense to add a compile-time switch between UTF-8 and # UTF-32 to let the installer decide on memory tradeoffs. ICU has an # internal macro that defines its own internal representation, and that # could conflict with our intended usage as well. # # Performance would suffer in the UTF-8 case, naturally, but the # difference in memory usage might be significant enough that we'd want to # leave the decision up to the installer. Having said that, the headache # of testing multiple versions of Perl6 might not be worth it. # # So, to wrap up, I'm soliciting thoughts on how best to start the Unicode # migration, and deal with the inevitable problems that will come up. I'm # hoping that most of the magic will be hidden in string.c, where we won't # have to worry about it, but we'll have to see. # # Now, this is admittedly being composed at 2:00 A.M, so my thoughts may # not be the most coherent, and for that I apologize. Most of my concern # stems from how best to add build steps to the various platforms without # ending up with a completely broken Parrot for weeks and developers # screaming about What the *HELL* is this error? Where is this library? # brane explodes. If these issues have already been beaten to death and # we've moved on to more interesting issues, of course I'll be interested # there as well. Overall you seem to be pretty on target. Of course, my brain isn't really built for character sets and stuff like that. Also note that I went to bed at one, was rudely awakened by a screaming toddler at two, didn't fall asleep again till four, and woke up at nine, so I'm probably not very coherent. I feel a little dizzy--I'm gonna take a nap. --Brent Dax [EMAIL PROTECTED] @roles=map {Parrot $_} qw(embedding regexen Configure) #define private public --Spotted in a C++ program just before a #include
RE: Unicode thoughts...
We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? Nope, nope, and nope. From their site - Operating systemCompilerTesting frequency Windows 98/NT/2000 Microsoft Visual C++ 6.0Reference platform Red Hat Linux 6.1 gcc 2.95.2 Reference platform AIX 4.3.3 xlC 3.6.4 Reference platform Solaris 2.6 Workshop Pro CC 4.2 Reference platform HP/UX 11.01 aCC A.12.10 Reference platform AIX 5.1.0 L Visual Age C++ 5.0 Regularly tested Solaris 2.7 Workshop Pro CC 6.0 Regularly tested Solaris 2.6 gcc 2.91.66 Regularly tested FreeBSD 4.4 gcc 2.95.3 Regularly tested HP/UX 11.01 CC A.03.10 Regularly tested OS/390 (zSeries)CC r10 Regularly tested AS/400 (iSeries)V5R1 iCCRarely tested NetBSD, OpenBSD Rarely tested SGI/IRIXRarely tested PTX Rarely tested OS/2 Visual Age Rarely tested Macintosh Needs help to port -(MBrod)- __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/
Re: Unicode thoughts...
This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) --Josh At 17:02 on 03/25/2002 PST, Charles Bunders [EMAIL PROTECTED] wrote: We also need to make sure ICU will work everywhere. And I do mean *everywhere*. Will it work on VMS? Palm OS? Crays? Nope, nope, and nope. From their site - Operating systemCompilerTesting frequency Windows 98/NT/2000 Microsoft Visual C++ 6.0Reference platform Red Hat Linux 6.1 gcc 2.95.2 Reference platform AIX 4.3.3 xlC 3.6.4 Reference platform Solaris 2.6 Workshop Pro CC 4.2 Reference platform HP/UX 11.01 aCC A.12.10 Reference platform AIX 5.1.0 L Visual Age C++ 5.0 Regularly tested Solaris 2.7 Workshop Pro CC 6.0 Regularly tested Solaris 2.6 gcc 2.91.66 Regularly tested FreeBSD 4.4 gcc 2.95.3 Regularly tested HP/UX 11.01 CC A.03.10 Regularly tested OS/390 (zSeries)CC r10 Regularly tested AS/400 (iSeries)V5R1 iCCRarely tested NetBSD, OpenBSD Rarely tested SGI/IRIXRarely tested PTX Rarely tested OS/2 Visual Age Rarely tested Macintosh Needs help to port -(MBrod)- __ Do You Yahoo!? Yahoo! Movies - coverage of the 74th Academy Awards® http://movies.yahoo.com/
RE: Unicode thoughts...
I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode)
Re: Unicode thoughts...
Hong Zhang wrote: I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) I guess it's obvious that I hadn't looked at the target platforms for ICU as closely as I probably should have. C vs. C++ doesn't concern me, as it can always be rewritten, but lack of platforms like OS X does. Given that, I think an interim solution consisting of basic Unicode utilities we'll need, such as Unicode_isdigit(). This can be a simple wrapper around isdigit() for the moment, until I sort out which files we need from the Unicode database, and what support functions/data structures will be required. Given that we're dedicated to either UTF-16 or UTF-32 for internal string representation (undecided as of yet, and isn't affected by this), we can get away with creating a simple unicode.{c.h} suite of functions that looks like: Parrot_Int Parrot_isDigit(char* glyph); We can get away with the simplicity here because the character array should already be a valid UTF-{16,32) string, and responsibility for making sure there's a valid glyph at that offset can be safely offloaded to the caller, if not higher up the calling chain. Also, it should be in a separate file because, assuming the final internal representation matches that of the RE engine, the engine can use these utilities as well. Now, admittedly this is only slightly better-thought-out than the origina proposal, but I think it has a much better chance of being implemented, and in a fairly short amount of time. (He said, knowing full well that there's always one more problem) ASCII versions of the functions should be almost trivial, and can be left in there as a compile-time switch should we choose to do an ASCII-only or UTF-8-only version. In conclusion, this approach feels more workable, and the full UTF-16 implementation details can be rolled out incrementally, rather than a single mass migration. If this suggestion flies, I'll rewrite strings.pdd and post it in the next few days. -- Jeff [EMAIL PROTECTED]
Re: Unicode thoughts...
Jeff wrote: Hong Zhang wrote: I think it will be relative easy to deal with different compiler and different operating system. However, ICU does contain some C++ code. It will make life much harder, since current Parrot only assume ANSI C (even a subset of it). Hong This is rather concerning to me. As I understand it, one of the goals for parrot was to be able to have a usable subset of it which is totally platform-neutral (pure ANSI C). If we start to depend too much on another library which may not share that goal, we could have trouble with the parrot build process (which was supposed to be shipped as parrot bytecode) I guess it's obvious that I hadn't looked at the target platforms for ICU as closely as I probably should have. C vs. C++ doesn't concern me, as it can always be rewritten, but lack of platforms like OS X does. Given that, I think an interim solution consisting of basic Unicode utilities we'll need, such as Unicode_isdigit(). This can be a simple wrapper around isdigit() for the moment, until I sort out which files we need from the Unicode database, and what support functions/data structures will be required. Given that we're dedicated to either UTF-16 or UTF-32 for internal string representation (undecided as of yet, and isn't affected by this), we can get away with creating a simple unicode.{c.h} suite of functions that looks like: Parrot_Int Parrot_isDigit(char* glyph); We can get away with the simplicity here because the character array should already be a valid UTF-{16,32) string, and responsibility for making sure there's a valid glyph at that offset can be safely offloaded to the caller, if not higher up the calling chain. Also, it should be in a separate file because, assuming the final internal representation matches that of the RE engine, the engine can use these utilities as well. Now, admittedly this is only slightly better-thought-out than the origina proposal, but I think it has a much better chance of being implemented, and in a fairly short amount of time. (He said, knowing full well that there's always one more problem) ASCII versions of the functions should be almost trivial, and can be left in there as a compile-time switch should we choose to do an ASCII-only or UTF-8-only version. In conclusion, this approach feels more workable, and the full UTF-16 implementation details can be rolled out incrementally, rather than a single mass migration. If this suggestion flies, I'll rewrite strings.pdd and post it in the next few days. -- Jeff [EMAIL PROTECTED] Okay, now I feel utterly silly, having just looked at chartypes/unicode.c. Well, that approach'll work. Wonder why nobody thought...greps for isdigit()...uh...never mind. I'll be over here, with the dunce cap on. -- Jeff [EMAIL PROTECTED]
RE: Unicode sorting...
I can't really believe that this would be a problem, but if they're integrated alphabets from different locales, will there be issues with sorting (if we're not planning to use the locale)? Are there instances where like characters were combined that will affect the sort orders? Yes, it is an issue. In the general case, you CANNOT sort strings of several locales/languages into a single order that would satisfy all of the locales/languages. One often quoted example is German and Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes between A and B in the former but after Z (not immediately, but doesn't matter here) in the latter. Similarly for all the accented alphabetic characters, the rules how they are sorted differ from one place to another , and many languages have special combinations like ch, ss, ij that require special attention. My understanding is there is NO general unicode sorting, period. The most useful one must be locale-sensitive, as defined by unicode collation. In practice, the story is even worse. For example, how do you sort strings comming from different locales, say I have an address book with names from all over the world. Which locale I should use to sort the names. Another example is the chinese has no definite sorting order, period. The commonly used scheme are phonetic-based or stroke-based. Since many characters have more than one pronounciations (context sensitive) and more than one forms (simplified and traditional). So if we have a mix content from china and taiwan, it is impossible to sort in a way everyone will feel happy. Also Chinese is space insensitive. In English, we have to use space to separate words. But in Chinese, there is no lexical words, only linguistic words. You can insert space between any two chinese characters without change their meaning. I heard a rumor long time ago, the unicode consortium was working on a locale independent collation, which can be used to sort mix content. As for Perl, I like to have several basic sortings: a) binary sorting b) locale independent general sort c) locale-sensitive sort based on unicode collation We could have more if possible. The general sort can be done by canonicalize all strings, remove case info, remove diacritics, remove font/width, then use binary sort. Hong
RE: Unicode sorting...
Another example is the chinese has no definite sorting order, period. The commonly used scheme are phonetic-based or stroke-based. Since many characters have more than one pronounciations (context sensitive) and more than one forms (simplified and traditional). So if we have a mix content from china and taiwan, it is impossible to sort in a way everyone will feel happy If this is the case, how would a regex like ^[a-zA-Z] work (or other, more sensitive characters)? If just about anything can come between A and Z, and letters that might be there in a particular locale aren't in another locale, then how will regex engine make the distinction? Will it have to create it's own locale-specific character table? Grant M. (is it just me, or is this looking more and more painful).
RE: Unicode sorting...
If this is the case, how would a regex like ^[a-zA-Z] work (or other, more sensitive characters)? If just about anything can come between A and Z, and letters that might be there in a particular locale aren't in another locale, then how will regex engine make the distinction? This syntax was designed for English. It just does not make any sense in Chinese. The Chinese just don't have sorting order for most of history. The phonetic order and stroke order was introduced only couple of hundred years ago. I don't really care how regex handle it. If I do need to search range or sort, I will create my own collator to convert the string into a normalized form, and hand it to regex or qsort. It is up to me to define the collator. The regex does not even need to care about the order. Of course, the regex will support some basic ordering for opto. Hong
Re: Unicode sorting...
If this is the case, how would a regex like ^[a-zA-Z] work (or other, more sensitive characters)? If just about anything can come between A and Z, and letters that might be there in a particular locale aren't in another locale, then how will regex engine make the distinction? This syntax was designed for English. It just does not make any sense in Chinese. It actually is rather faulty for English, too. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Unicode sorting...
I can't really believe that this would be a problem, but if they're integrated alphabets from different locales, will there be issues with sorting (if we're not planning to use the locale)? Are there instances where like characters were combined that will affect the sort orders? Yes, it is an issue. In the general case, you CANNOT sort strings of several locales/languages into a single order that would satisfy all of the locales/languages. One often quoted example is German and Swedish/Finnish: the LATIN CAPITAL LETTER A WITH RING ABOVE comes between A and B in the former but after Z (not immediately, but doesn't matter here) in the latter. Similarly for all the accented alphabetic characters, the rules how they are sorted differ from one place to another , and many languages have special combinations like ch, ss, ij that require special attention. Unicode defines a canonical ordering which has hooks for locale specific rules: http://www.unicode.org/unicode/reports/tr10/ -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
RE: Unicode sorting...
At 11:29 AM 6/8/2001 -0700, Hong Zhang wrote: If this is the case, how would a regex like ^[a-zA-Z] work (or other, more sensitive characters)? If just about anything can come between A and Z, and letters that might be there in a particular locale aren't in another locale, then how will regex engine make the distinction? This syntax was designed for English. It just does not make any sense in Chinese. The Chinese just don't have sorting order for most of history. The phonetic order and stroke order was introduced only couple of hundred years ago. The A-Z syntax is really a shorthand for All the uppercase letters. (Originally at least) I won't argue the problems with sorting various sets of characters in various locales, but for regexes at least it's not an issue, because the point isn't sorting or ordering, it's identifying groups. We just need to make sure there's a named group for the different languages we know of--things like [[:kanji]] or [[:hiragana]] for example. (They should also be named in the language they represent, but I'm going to take a miss on trying to wedge an example in here, as I've a hard enough time getting letters with umlauts in) Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode sorting...
The A-Z syntax is really a shorthand for All the uppercase letters. (Originally at least) I won't argue the problems with sorting various sets of characters in various locales, but for regexes at least it's not an issue, because the point isn't sorting or ordering, it's identifying groups. We just need to make sure there's a named group for the different languages we know of--things like [[:kanji]] or [[:hiragana]] for example. It's spelled \p{...} (after I fixed a silly typo in bleadperl) $ ./perl -Ilib -wle 'print a if \x{30a1} =~ /\p{InKatakana}/' a $ grep 30A1 lib/unicode/Unicode.txt 30A1;KATAKANA LETTER SMALL A;Lo;0;L;N; 3301;SQUARE ARUHUA;So;0;L;square 30A2 30EB 30D5 30A1N;SQUARED ARUHUA 3332;SQUARE HUARADDO;So;0;L;square 30D5 30A1 30E9 30C3 30C9N;SQUARED HUARA DDO FF67;HALFWIDTH KATAKANA LETTER SMALL A;Lo;0;L;narrow 30A1N; $ -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Unicode handling
Dan Sugalski writes: : Fair enough. I think there are some cases where there's a base/combining : pair of codepoints that don't map to a single combined-character code : point. Not matching on a glyph boundary could make things really odd, but : I'd hate to have the checking code on by default, since that'd slow down : the common case where the string in NFC won't have those. Assume that in practice most of the normalization will be done by the input disciplines. Then we might have a pragma that says to try to enforce level 1, level 2, level 3 if your data doesn't match your expectations. Then hopefully the expected semantics of the operators will usually (I almost said "normally" :-) match the form of the data coming in, and forced conversions will be rare. That's how I see it currently. But the smarter I get the less I know. Larry
Re: Unicode handling
Garrett Goebel writes: : Someone please clue me in. A pointer to an RFC which defines the use of : colons in Perl6 among other things would help. Heh. If you read the RFCs, you'll discover one of the basic rules of language redesign: everybody wants the colon. And it never seems to occur to people that we'll actually have to break Perl 5's ?: operator in order to give them the colon. :-) Larry
Re: Unicode handling
At 07:21 AM 3/27/2001 -0800, Larry Wall wrote: Dan Sugalski writes: : Fair enough. I think there are some cases where there's a base/combining : pair of codepoints that don't map to a single combined-character code : point. Not matching on a glyph boundary could make things really odd, but : I'd hate to have the checking code on by default, since that'd slow down : the common case where the string in NFC won't have those. Assume that in practice most of the normalization will be done by the input disciplines. Then we might have a pragma that says to try to enforce level 1, level 2, level 3 if your data doesn't match your expectations. Then hopefully the expected semantics of the operators will usually (I almost said "normally" :-) match the form of the data coming in, and forced conversions will be rare. The only problem with that is it means we'll be potentially altering the data as it comes in, which leads back to the problem of input and output files not matching for simple filter programs. (Plus it means we spend CPU cycles altering data that we might not actually need to) It might turn out that deferred conversions don't save anything, and if that's so then I can live with that. And we may feel comfortable declaring that we preserve equivalency in Unicode data only, and that's OK too. (Though *you* get to call that one... :) Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Unicode handling
At 08:37 PM 3/26/2001 +, [EMAIL PROTECTED] wrote: Damien Neil [EMAIL PROTECTED] writes: So $c = chr(ord($c)) could change $c? That seems odd. It changes its _representation_ (e.g. from 0x45,ASCII to 0xC1,EBCDIC) but not its "fundamental" 'LATIN CAPITAL LETTER A'-ness. Then of course someone will want it to be the number 0x45 and not do that 'cos they are using chr/ord to mess with JPEG image data... So there needs to be a 'binary' encoding which they can use. That doesn't seem to be what Dan was saying, however. And Dan is the one "in charge" on this list - so my perl5.7-ish view may be wrong. "In charge" is such a strong phrase. (And not what I thought the job originally was, but that's a separate issue...) It would make perfect sense to me for chr(ord($c)) to return $c in a different encoding. (Assuming, of course, that $c is a single character.) Assume ord is dependent on the current default encoding. use utf8; # set default encoding. my $e : ebcdic = 'a'; my $u = chr(ord($e)); If ord is dependent on the current default encoding, I would expect the above to leave the UTF-8 string "a" in $u. This makes sense to me. Good. I'm afraid this isn't what I'd normally think of--ord to me returns the integer value of the first code point in the string. That does mean that A is different for ASCII and EBCDIC, but that's just One Of Those Things. The alternative is for us to do data conversions some times (when we're pulling data out of an EBCDIC or Shift-JIS string in a Unicode block) but not others (when we're pulling binary data out in a Unicode or EBCDIC block). That seems a little off to me, but I could well be wrong. It also means we may well mangle data that's incorrectly tagged--if, for example an input filter tagged binary data with a non-binary type, which isn't that unlikely. If ord is dependent on the encoding of the string it gets, as Dan was saying, than ord($e) is 0x81, It it could still be 0x81 (from ebcdic) with the encoding carried along with the _number_ if we thought that worth the trouble. (It isn't too bad for assignment but is far from clear what 2 (ebcdic) * 0xA1(iso_8859_7) might mean - perhaps we drop the tag if anything other the + or - happens. Or what we do with it if it's stringified. The only thing I can see keeping the tag around for would be later chr() and pack() calls, and that doesn't seem like it'd happen often enough to justify the overhead. Could be wrong, though. and $u is "\x81". This seems strange. Hmm. It suddenly occurs to me that I may have been misinterpreting: ord is dependent on both the encoding of its argument (to determine the logical character containing in that argument) and the current default encoding (to determine the value in the current character set representing that character). That wasn't my intention. I was thinking that chr was bound to the current default encoding, and ord was bound to the string type of the scalar being ord-ed. Dan --"it's like this"--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk