Re: Stacks, registers, and bytecode. (Oh, my!)
Larry Wall [EMAIL PROTECTED] wrote: It may certainly be valuable to (not) think of it that way, but just don't be surprised if the regex folks come along and borrow a lot of your opcodes to make things that look like (in C): while (s send isdigit(*s)) s++; This is the bit that scares me about unifying perl ops and regex ops: I see perl ops as relatively heavyweight things that can absorb the costs of 'heavyweight' dispatch (function call overhead, etc etc), while regex stuff needs to be very lightweight, eg while (op = *optr++) { switch (op) { case FOO: while (s send isdigit(*s)) s++; case BAR: while (s send isspace(*s)) s++; can we really unify them without taking a performance hit?
Re: Stacks, registers, and bytecode. (Oh, my!)
On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote: This is the bit that scares me about unifying perl ops and regex ops: can we really unify them without taking a performance hit? Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like. Second, Ruby uses a giant switch instead of function pointers for their op despatch loop; Matz says it doesn't make that much difference in terms of performance. I don't know if I've mentioned this before, but http://www-6.ibm.com/jp/developerworks/linux/001027/ruby_qa.html was my interview with Matsumoto about his ideas for Perl 6 and his experiences from Ruby. It's in Japanese, so http://www.excite.co.jp/world/url/ may help. -- Familiarity breeds facility. -- Megahal (trained on asr), 1998-11-06
Re: Stacks, registers, and bytecode. (Oh, my!)
Simon Cozens [EMAIL PROTECTED] opined: On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote: This is the bit that scares me about unifying perl ops and regex ops: can we really unify them without taking a performance hit? Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like. Second, Ruby uses a giant switch instead of function pointers for their op despatch loop; Matz says it doesn't make that much difference in terms of performance. I think it would be very messy to have both types of operands in the same dispatch loop. I'd much rather have a 'regex start' opcode which calls a separate dispath loop function, and which then interprets any further ops in the bytestream as regex ops. That way we double the number of 8-bit ops, and can have all the regex-specific state variables (s, send etc in the earlier example) and logic separated out. I don't know if I've mentioned this before, but http://www-6.ibm.com/jp/developerworks/linux/001027/ruby_qa.html was my interview with Matsumoto about his ideas for Perl 6 and his experiences from Ruby. It's in Japanese, so http://www.excite.co.jp/world/url/ may help. A talk of jewelry Perl developer From Mr. Simon Cozens to Ruby developer It is also as a pine It dies and is the question and reply to Mr. [squiggle] :-)
Re: PDD 2nd go: Conventions and Guidelines for Perl Source Code
On Tue, 5 Jun 2001, Hugo wrote: I'd also like to see a specification for indentation when breaking long lines. Fwiw, the style that I prefer is: someFunc( really_long_param_1, (long_parm2 || parm3), really_long_other_param ); or, for really complex expressions: ( really_long_param_1 (parm1 || long_parm1) ( yet_another_long_param parm2 (long_parm2 || parm3) ) ); Putting the final close paren on the next line makes it easier to tell where the (sub)expression finishes. Dave
Re: Stacks, registers, and bytecode. (Oh, my!)
On Tue, 5 Jun 2001, Dave Mitchell wrote: dispatch loop. I'd much rather have a 'regex start' opcode which calls a separate dispath loop function, and which then interprets any further ops in the bytestream as regex ops. That way we double the number of 8-bit ops, and can have all the regex-specific state variables (s, send etc in the earlier example) and logic separated out. This is an interesting idea...could we use this more generally to multiply our number of opcodes? Basically, you have one set of opcodes for (e.g.) string parsing, one set for math, etc, all of which have the same value. Then you have a set of opcodes that tells the interpreter which opcode table to look in. The 'switching' opcodes then become overhead, but if there aren't too many of those, perhaps its acceptable. And it would mean that we could specialize the opcodes a great deal more (if, of course, that is desirable), and still have them fit in an octet. (Sorry if this is a stupid question, but please be patient; I've never done internals stuff before.) Dave
Should we care much about this Unicode-ish criticism?
Courtesy of Slashdot, http://www.hastingsresearch.com/net/04-unicode-limitations.shtml I'm not sure if this is an issue for us or not, as we're generally language-neutral, and I don't see any technical issues with any of the UTF-* encodings having headroom problems. It does argue for abstracting out the string handling code a bit so it can be replaced without completely rebuilding perl, but I'm not sure that it's that strong an argument. (Though it would be nice to upgrade perl from Unicode 3.1 to 3.2 with the equivalent of a module upgrade rather than a full rebuild) Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: PDD 2nd go: Conventions and Guidelines for Perl Source Code
On Tue, 29 May 2001 18:25:45 +0100 (BST), Dave Mitchell wrote: diffs: -KR style for indenting control constructs +KR style for indenting control constructs: ie the closing C} should +line up with the opening Cif etc. On Wed, 30 May 2001 10:37:06 -0400, Dan Sugalski wrote: I realize that no matter what style we choose, there will be a good crop of people who won't be thrilled with it. (For the record, we can count me as one, if that makes anyone feel any better :) That's inevitable. If you have a diff/patching suite that falls over whitespace, you have a problem with diff, not with style. One can always to a pretty-print cleanup of the code, before doing the diff, if all else fails. IMO this is not worth bickering over. -- Bart.
Re: Should we care much about this Unicode-ish criticism?
At 06:22 PM 6/5/2001 +0100, Simon Cozens wrote: On Tue, Jun 05, 2001 at 10:17:08AM -0700, Russ Allbery wrote: Is it just me, or does this entire article reduce not to Unicode doesn't work but Unicode should assign more characters? Yes. And Unicode has assigned more characters; it's factually challenged. The other issue it actively brought up was the complaint about having to share glyphs amongst several languages, which didn't strike me as all that big a deal either, except perhaps as a matter of national pride and/or easy identification of the language of origin for a glyph. Not being literate in any of the languages in question, though, I didn't feel particularly qualified to make a judgement as to the validity of the complaints. It does bring up a deeper issue, however. Unicode is, at the moment, apparently inadequate to represent at least some part of the asian languages. Are the encodings currently in use less inadequate? I've been assuming that an Anything-Unicode translation will be lossless, but this makes me wonder whether that assumption is correct. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
RE: Stacks, registers, and bytecode. (Oh, my!)
On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote: This is the bit that scares me about unifying perl ops and regex ops: can we really unify them without taking a performance hit? Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like. Second, Ruby uses a giant switch instead of function pointers for their op despatch loop; Matz says it doesn't make that much difference in terms of performance. Function pointer dispath is normally faster or as fast as switch. The main down side is the context. A typical regular expression engine can pre-fetch many variables into register local, they can be efficiently used by all switch cases. However, the common context for regular expression is relative small, I am not sure of the performance hit. Hong
Re: Should we care much about this Unicode-ish criticism?
On Tue, Jun 05, 2001 at 01:31:38PM -0400, Dan Sugalski wrote: The other issue it actively brought up was the complaint about having to share glyphs amongst several languages, which didn't strike me as all that big a deal either, except perhaps as a matter of national pride and/or easy identification of the language of origin for a glyph. Not being literate in any of the languages in question, though, I didn't feel particularly qualified to make a judgement as to the validity of the complaints. There are a number of related problems here; the Han unification effort has pissed off some Asians on several counts. There easiest part to explain is display; this isn't something that Perl particularly needs to care about, but the same glyph may need to look different if it's in Chinese rather than in Japanese. For the rest, I refer the assembly to my undergraduate dissertation :) : Unicode itself is, like the JIS standard, simply an enumeration of characters with their orderings; it says nothing about how the data is represented to the computer, and must be supplemented by one of several Unicode Transformation Formats which describe the encoding. However, despite the huge benefits to programmers worldwide, two critical problems are hindering the adoption of Unicode amongst the Japanese computer-using community. The first objection is technical, and the second is more sociological. The technical objection stems from the fact that the Unicode Consortium initially assigned a finite space for all Japanese, Chinese and Korean characters, allowing only just under 28,000 characters. This space has nearly been filled, with 20,902 basic characters already accepted, and 6,585 new characters under review; the situation is not going to get any better as Chinese characters are invented for use in proper names and so on. It is evident that 28,000 characters is not going to be anywhere near enough, and programmers have felt betrayed that the promise of a `fully Universal character set' will satisfy all other languages but theirs. Thankfully, the Unicode Consortium has recently assigned another extension plane for CJK characters and adopted a further 42,711 characters, meaning that all the characters in the Chinese Han Yu Da Zidian and the Japanese Morohashi Dai Kanwa Jiten are now adopted into Unicode. However, many programmers are unaware of the extension plane and still feel that the Unicode Consortium is ignoring their plight. More serious, however, is the decision to unify equivalent characters in the Chinese, Japanese and Korean character sets into a single table known as `Unihan'[10]. This has proved controversial primarily through lack of understanding of the nature of `equivalent characters': the Unihan table does not constitute a dumbing down of the character set, as simplified and traditional forms of characters have been maintained. However, Chinese and Japanese variants of the same single character have been unified. The Unicode standard seeks to encode characters rather than glyphs[11] , and hence the variant characters which come about due to variations in writing style have been unified. On the other hand, characters undergoing structural variance have not been unified. The principles on which Han Unification took place, are, according to [Graham, 2000], not dissimilar to those used to unify characters in the legacy JIS and other character sets. Three rules were used to determine whether or not two kanji should be considered equivalent: Source Separation Rule If two kanji were distinct in a primary source character set (JIS in the case of Japanese, GB2312-80 and other GB standards for Chinese, KSC5601-1987 for Korean, and so on) then they should not be unified. This would allow round-trip-conversion between Unicode and the original source. For instance, the following variants of the character for tsurugi, sword, were not unified: [Picture omitted] Non-Cognate Rule Kanji which are not cognate are not variants; this prohibits, for instance, the unifiation of the following characters: [Picture omitted] Component Structure If a unification is acceptable under the above rules, unification is only carried out if the characters share the same radicals and component features, taking into consideration their arrangement. Using these rules, the CJK Joint Research Group of the ISO technical committee on Unicode reduced a candidate 121,000 Han characters into 20,902 unique characters [12]. On the other hand, there are some valid objections from Japanese, on three specific counts [13]: Firstly, the JIS standard defines, along with the ordering and enumeration of its characters, their glyph shape. Unicode, on the other hand does not. This means that as far as Unicode is concerned, there is literally no distinction between two distinct shapes and hence no way to specify which should be used. This becomes particularly emotive when one is, for instance, attempting to
RE: Should we care much about this Unicode-ish criticism?
Courtesy of Slashdot, http://www.hastingsresearch.com/net/04-unicode-limitations.shtml I'm not sure if this is an issue for us or not, as we're generally language-neutral, and I don't see any technical issues with any of the UTF-* encodings having headroom problems. I think the author confused himself. The Unicode itself is not sufficient to process human language, no matter how many characters it includes. It is just an encoding. Just take Chinese as example, only small percent (10%) of Chinese can read more than 6000 characters. The biggest dictionary I know of includes about 65000 characters, many of them even linguists can not agree with each other. Some of the characters are kind of research result of the authors. It is impossible to includes those characters into an international standard, such as Unicode. Unicode contains surrogates for future growth. We still have about 1M code points left for allocation. Eventually it will include much more characters than anyone can care about. Hong
Re: Stacks, registers, and bytecode. (Oh, my!)
On Mon, Jun 04, 2001 at 06:04:10PM -0700, Larry Wall wrote: Well, other languages have explored that option, and I think that makes for an unnatural interface. If you think of regexes as part of a larger language, you really want them to be as incestuous as possible, just as any other part of the language is incestuous with the rest of the language. That's part of what I mean when I say that I'm trying to look at regular expressions as just a strange variant of Perl code. Looking at it from a slightly different angle, regular expressions are in great part control syntax, and library interfaces are lousy at implementing control. Right. Having the regex opcodes be perl opcodes will certainly make implementing (?{ ... }) much easier and probably faster too. Also re references that we have now will become similar to subroutines for pattern matching. I think there are a lot of benefits to the re engine not to be separate from the core perl ops. Graham.
RE: Should we care much about this Unicode-ish criticism?
Firstly, the JIS standard defines, along with the ordering and enumeration of its characters, their glyph shape. Unicode, on the other hand does not. This means that as far as Unicode is concerned, there is literally no distinction between two distinct shapes and hence no way to specify which should be used. This becomes particularly emotive when one is, for instance, attempting to represent a person's name - if they have a particular preferred variant character with which they write their name, there is no way to communicate that to the computer, and information is lost. This is a very common practice, nothing to surprise. As you can tell, my name is "hong zhang", which already lost "chinese tone" and "glyph". "hong" has 4 tones, each tone can be any of several characters, each character can be one of several glyphs (simplified and tranditional). However, it does not really matter to still call it my name. The second objection is again related to character versus glyph issues: since Chinese, I think this problem =~ locale. For any unicode character, you can not properly tell its lower case or upper case without considering locale. And unicode does not encode locale. Finally, there is a historiographical issue; when computers are used to digitise and store historical literature containing archaic characters, specifying the exact variant character becomes an important consideration. I believe this should be handled by application. This kind of work is needed by research. Perl should not care about it. Hong
Re: Should we care much about this Unicode-ish criticism?
On 05 Jun 2001 11:07:11 -0700, Russ Allbery wrote: Particularly since part of his contention is that 16 bits isn't enough, and I think all the widely used national character sets are no more than 16 bits, aren't they? It's not really important. UTF-8 is NOT limited to 16 bits (3 bytes). With 4 bytes, UTF-8 can represent 20 bit charatcers, i.e. 6 times more than the desired number of 17. See http://czyborra.com/utf/#UTF-8 for how it this is done. And the major flaw that I see in acceptance of Unicode, is that the Unicode text files are not Ascii compatible. UTF-8 file are. That makes for a very nice upgrade path. -- Bart.
Re: Should we care much about this Unicode-ish criticism?
On Tue, Jun 05, 2001 at 09:16:05PM +0200, Bart Lateur wrote: Unicode text files No such animal. Unicode's a character repertoire, not an encoding. See you at my Unicode tutorial at TPC? :) -- buf[hdr[0]] = 0;/* unbelievably lazy ken (twit) */ - Andrew Hume
RE: Should we care much about this Unicode-ish criticism?
At 11:18 AM 6/5/2001 -0700, Hong Zhang wrote: Firstly, the JIS standard defines, along with the ordering and enumeration of its characters, their glyph shape. Unicode, on the other hand does not. This means that as far as Unicode is concerned, there is literally no distinction between two distinct shapes and hence no way to specify which should be used. This becomes particularly emotive when one is, for instance, attempting to represent a person's name - if they have a particular preferred variant character with which they write their name, there is no way to communicate that to the computer, and information is lost. This is a very common practice, nothing to surprise. As you can tell, my name is hong zhang, which already lost chinese tone and glyph. hong has 4 tones, each tone can be any of several characters, each character can be one of several glyphs (simplified and tranditional). However, it does not really matter to still call it my name. I dunno. It's one thing to have a word represented with non-native characters--loss is expected. It's quite another to have it spelled out in an encoding that's supposed to preserve such things and have it not actually do that. That'd be like having my name spelled or pronounced differently because it was encoded in Unicode instead of ASCII. That's just plain wrong. The second objection is again related to character versus glyph issues: since Chinese, I think this problem =~ locale. For any unicode character, you can not properly tell its lower case or upper case without considering locale. And unicode does not encode locale. Yeah, that is a problem. The alternative isn't any better, unfortunately. Human languages are a pain. :) We're going to need case-translation stuff for perl 6, I think, if lc, uc, and its ilk are going to work properly. Finally, there is a historiographical issue; when computers are used to digitise and store historical literature containing archaic characters, specifying the exact variant character becomes an important consideration. I believe this should be handled by application. This kind of work is needed by research. Perl should not care about it. I think I'd agree there. Different versions of a glyph are more a matter of art and handwriting styles, and that's not really something we ought to get involved in. The european equivalent would be to have many versions of A, so we could represent the different ways it was drawn in various illuminated manuscripts. That seems rather excessive. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Should we care much about this Unicode-ish criticism?
At 12:40 PM 6/5/2001 -0700, Russ Allbery wrote: Bart Lateur [EMAIL PROTECTED] writes: UTF-8 is NOT limited to 16 bits (3 bytes). That's an odd definition of byte you have there. :) Maybe it's RAD50. :) Still, it may take 3 bytes to represent in UTF-8 a character that takes 2 bytes in UTF-16. With 4 bytes, UTF-8 can represent 20 bit charatcers, i.e. 6 times more than the desired number of 17. UTF-8 is a mapping from a 31-bit (yes, not 32, interestingly enough) character numbering, and as such can represent over two billion characters. For some reason that I've never understood, the Unicode folks are limiting that to only a subset of what one can do with 31 bits by putting an artificial limit on how high of character values they're willing to assign, but even with that as soon as they started using the higher planes, there's easily enough space to add every character the author mentioned and then some. Yeah, the limitations are kind of odd. I'm presuming they're in there so the technical folks have at least some sort of a stick to smack the crankier non-technical folks with. (As an aside, UTF-8 also is not an X-byte encoding; UTF-8 is a variable byte encoding, with each character taking up anywhere from one to six bytes in the encoded form depending on where in Unicode the character falls.) Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, but that's in the Unicode 3.0 standard. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Stacks, registers, and bytecode. (Oh, my!)
Graham Barr wrote: I think there are a lot of benefits to the re engine not to be separate from the core perl ops. So does it start with a split(//,$bound_thing) or does it use substr(...) with explicit offsets?
Re: Stacks, registers, and bytecode. (Oh, my!)
On Tue, Jun 05, 2001 at 03:31:24PM -0500, David L. Nicol wrote: Graham Barr wrote: I think there are a lot of benefits to the re engine not to be separate from the core perl ops. So does it start with a split(//,$bound_thing) or does it use substr(...) with explicit offsets? Eh ? Nobody is suggesting we implement re's using the current set of perl ops, but that we extend the set with ops needed for re's. So they use the same dispatch loop and that the ops can be intermixed Graham.
Re: Should we care much about this Unicode-ish criticism?
On Tuesday 05 June 2001 03:24 pm, Dan Sugalski wrote: The second objection is again related to character versus glyph issues: since Chinese, I think this problem =~ locale. For any unicode character, you can not properly tell its lower case or upper case without considering locale. And unicode does not encode locale. Yeah, that is a problem. The alternative isn't any better, unfortunately. Human languages are a pain. :) We're going to need case-translation stuff for perl 6, I think, if lc, uc, and its ilk are going to work properly. Yes, we've discussed this off and on for various things - character class identification, sorting, comparison, case-translation. Where do you draw the line, lines, and/or default line? I'd like Perl to be able to handle textual information, and not just do character manipulation, but that doesn't mean at the core level. Some additional stuff to ponder over, and maybe Unicode addresses these - I haven't been able to read *all* the Unicode stuff yet. (And, yes, Simon, you will see me in class.) Some languages don't have upper or lower case. Are tests and translations on caseless characters true or false? (Or undefined?) Should the same Unicode character, when used in two different languages, be string equivalent? Asciibetical order is one thing, as it (roughly) maps alphabetical order for English. But unless you've been blessed with a root language for Unicode mapping (such as Arabic), Unicodical sorting is going to be non-sensical, as you hop between your language variants and the characters encoded somewhere else (as in Farsi). And, of course, there are several different orderings for eastern glyph languages, IINM. But I think it'd be too heavy to make Perl inherently locale-aware. The best, I think, would be to have Perl simply be Unicode neutral - to treat the characters (with any equivalencies, etc) as just data - and to allow locale modules to replace or supplement the ops/functions/* that *is* locale aware. That would allow all the locale-specific handling code to be written/debugged/distributed separately from the core on its own timeframe. It would ultimately lead to a little more consistency, since everyone can use a common handler instead of rolling your own. No need to have locale handlers for locales you won't use. Of course, being Unicode neutral, that still leaves some stuff (like case determination) undefined. So maybe there should be a default locale in place - the current, or barring that, English, I suppose. -- Bryan C. Warnock [EMAIL PROTECTED]
Re: Should we care much about this Unicode-ish criticism?
On Tue, Jun 05, 2001 at 05:39:36PM -0400, Bryan C . Warnock wrote: Some languages don't have upper or lower case. Are tests and translations on caseless characters true or false? (Or undefined?) I'd say undefined. Should the same Unicode character, when used in two different languages, be string equivalent? YES. Definitely. Same Unicode character, same thing. You wanted something else, use a different Unicode character. Asciibetical order is one thing, as it (roughly) maps alphabetical order for English. But unless you've been blessed with a root language for Unicode mapping (such as Arabic), Unicodical sorting is going to be non-sensical, as you hop between your language variants and the characters encoded somewhere else (as in Farsi). And, of course, there are several different orderings for eastern glyph languages, IINM. Not our problem. There are collation sequences within the the various subsets, and these'll work fine if we go by UTR#10. If you ask for a non-sensical comparison between two different languages, you'll get one. But I think it'd be too heavy to make Perl inherently locale-aware. The best, I think, would be to have Perl simply be Unicode neutral - to treat the characters (with any equivalencies, etc) as just data Strongly agree. That would allow all the locale-specific handling code to be written/debugged/distributed separately from the core on its own timeframe. Strongly agree. Of course, being Unicode neutral, that still leaves some stuff (like case determination) undefined. So maybe there should be a default locale in place - the current, or barring that, English, I suppose. Default to ASCII-ish and make it very, very easy for locale handling modules to override the various pieces of the puzzle. -- It can be hard to tell an English bigot from a monoglot with an inferiority complex, but one cannot tell a Welshman any thing a tall. - Geraint Jones.
Re: Should we care much about this Unicode-ish criticism?
Dan Sugalski [EMAIL PROTECTED] writes: At 12:40 PM 6/5/2001 -0700, Russ Allbery wrote: (As an aside, UTF-8 also is not an X-byte encoding; UTF-8 is a variable byte encoding, with each character taking up anywhere from one to six bytes in the encoded form depending on where in Unicode the character falls.) Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, but that's in the Unicode 3.0 standard. Yes, it changed with Unicode 3.1 when they started allocating characters from higher planes. Far and away the best reference for UTF-8 that I've found is RFC 2279. It's much more concise and readable than the version in the Unicode standard, and is more aimed at implementors and practical considerations. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
Re: Should we care much about this Unicode-ish criticism?
Bryan C Warnock [EMAIL PROTECTED] writes: Some additional stuff to ponder over, and maybe Unicode addresses these - I haven't been able to read *all* the Unicode stuff yet. (And, yes, Simon, you will see me in class.) Some languages don't have upper or lower case. Are tests and translations on caseless characters true or false? (Or undefined?) Caseless characters should be guaranteed unchanged by conversion to upper or lower case, IMO. Case is a normative property of characters in Unicode, so case mappings should actually be pretty well-defined. Note that there are actually three cases in Unicode, upper, lower, and title case, since there are some characters that require the third distinction (stuff like Dz is generally used as an example). Should the same Unicode character, when used in two different languages, be string equivalent? The way to start solving this whole problem is probably through normalization; Unicode defines two separate normalizations, one of which collapses more similar characters than the other. One is designed to preserve formatting information while the other loses formatting information. (The best example of how they differ is that one leaves the ffi ligature alone and the other breaks it down into three separate characters.) Perl should allow programmers to choose their preferred normalization schemes or none at all. (There are really four normalization schemes; in two of them, you leave things fully decomposed, and in the other two you recompose characters as much as possible.) -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
Re: Should we care much about this Unicode-ish criticism?
On Tuesday 05 June 2001 05:49 pm, Simon Cozens wrote: YES. Definitely. Same Unicode character, same thing. You wanted something else, use a different Unicode character. I don't understand. There *is* only one character. I can't choose another. Take 0x0648, for instance. It's both waw, the 27th letter of the Arabic alphabet, and veh, the 30th letter of the Persian alphabet, which aren't the same letter. Same character, different letters. Equivalent, or different? In Unicode, or locale independent, they're the same, I've no problem with that. Within one locale or the other. I'm not so sure. I think it needs to be able to go both ways, with equivalence perhaps being the default. (Perhaps this need only be so simple as to be able to tag and query (via attributes, for instance) the language of the string, and handling the logic yourself. If the languages differ, no sense in comparing, yadda yadda yadda. Then again, whether it is a difference or not may also be a language issue. I'd be inclined to think that waw and veh are different, but Gift (in English) and Gift (in Gernan) are the same. To me, those are the same characters and same letters (even though, I guess, technically they are not), with just different meanings.) In either case (or perhaps it is an extension of the same case), each locale should be able to specify and handle its own determination of equivalency. As I watch everyone talk about the eastern languages, and I think of the middle eastern languages, I realize what a mess this potentially is. For the most part, Hong is right - it's for the applications to handle. But I think that we need to have a clear understanding of what we're asking the applications to handle, in an effort to make the hard things easy. (And for some of these languages, it can be quite hard.) -- Bryan C. Warnock [EMAIL PROTECTED]
Re: Should we care much about this Unicode-ish criticism?
Simon Cozens [EMAIL PROTECTED] writes: On Tue, Jun 05, 2001 at 03:27:03PM -0700, Russ Allbery wrote: Caseless characters should be guaranteed unchanged by conversion to upper or lower case, IMO. I think Bryan's asking more about \p{IsUpper} than uc(). Ahh... well, Unicode classifies them for us, yes? Lowercase, Uppercase, Titlecase, and Other, IIRC. So a caseless character wouldn't show up in either IsLower or IsUpper. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
RE: Should we care much about this Unicode-ish criticism?
The problem as I see it, is not that the mechanism can't handle the languages, it is that the Latin/Gothic countries chose first, and gave what's left to the Oriental countries. This is evident in the Musical Symbols and even Byzantine Musical Symbols. Are these character sets more important than the actual language character sets being denied to the other countries? Are musical and mathematical symbols even a language at all? Yes, I understand that they are in the sense that they convey information, but if Unicode is only trying to generically represent common use language, then some of the characters (perhaps sets) should go. And if we go the other way and say that this is intended to represent every sort of written, spoken, or symbolic communication, then it really opens up the floodgates (I need a character for the men's room sign, please). Here are some questions for English speakers to ask themselves about Unicode: Are the original ascii graphical characters somehow more worthy of inclusion than the Chinese characters? Aren't Unicode 0xBD (the one-half character) and 1/2 the same? When was the last time that you saw the cent sign on a computer? When was the last time that you saw the cent sign anywhere? It seems to me that Unicode, in it's present form, although a valiant attempt, is just a 'better' ascii, and not a complete solution. Grant M.
Re: Should we care much about this Unicode-ish criticism?
NeonEdge [EMAIL PROTECTED] writes: This is evident in the Musical Symbols and even Byzantine Musical Symbols. Are these character sets more important than the actual language character sets being denied to the other countries? Are musical and mathematical symbols even a language at all? At the same time as 246 Byzantine Musical Symbols and 219 Musical Symbols were added, 43,253 Asian language ideographs were added. I fail to see the problem. Musical and mathematical symbols are certainly used more frequently than ancient Han ideographs that have been obsolete for 2,000 years, and it's not like the ideographs are having major difficulties being added to Unicode either. If the author of the original paper referred to here thinks there are still significant characters missing from Unicode, he should stop whining about it and put together a researched proposal. That's what the Byzantine music researchers did, and as a result their characters have now been added. This is how standardization works. You have to actually go do the work; you can't just complain and expect someone else to do it for you. In the meantime, the normally-encountered working character set of modern Asian languages has been in Unicode from the beginning, and currently the older and rarer characters and the characters used these days only in proper names are being backfilled at a rate of tens of thousands per Unicode revision. How this can then be described as ignoring Asian languages boggles me beyond words. There are a lot of characters. It takes time. Rome wasn't built in a day. It seems to me that Unicode, in it's present form, although a valiant attempt, is just a 'better' ascii, and not a complete solution. It seems to me that you haven't bothered to go look at what Unicode is actually doing. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
Re: Should we care much about this Unicode-ish criticism?
Dan Sugalski writes: : Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, : but that's in the Unicode 3.0 standard. Doesn't really matter where they install the artificial cap, because for philosophical reasons Perl is gonna support larger values anyway. It's just that 4 bytes of UTF-8 happens to be large enough to represent anything UTF-16 can represent with surrogates. So they refuse to believe in anything longer than 4 bytes, even though the representation can be extended much further. (Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!) They also arbitrarily define UTF-32 to not use higher values than 0x10, but that doesn't mean we're gonna send in the high-bit Nazis if people want higher values for their own purposes. But since the names UTF-8 and UTF-32 are becoming associated with those arbitrary restrictions, it's getting even more important to refer to Perl's looser style as utf8 (and, potentially, utf32). I don't know if Perl will have a utf16 that is distinguised from UTF-16. Larry
Re: Should we care much about this Unicode-ish criticism?
Larry Wall [EMAIL PROTECTED] writes: Doesn't really matter where they install the artificial cap, because for philosophical reasons Perl is gonna support larger values anyway. It's just that 4 bytes of UTF-8 happens to be large enough to represent anything UTF-16 can represent with surrogates. So they refuse to believe in anything longer than 4 bytes, even though the representation can be extended much further. (Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!) That's probably unnecessary; I really don't expect them to ever use all 31 bytes that the IETF-standardized version of UTF-8 supports. I don't know if Perl will have a utf16 that is distinguised from UTF-16. I wouldn't bother spending any time on UTF-16 beyond basic support for converting away from it. It combines the worst of both worlds, and I don't expect it to be used much now that they've buried the idea of keeping Unicode to 16 bits. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
Re: Should we care much about this Unicode-ish criticism?
Russ Allbery [EMAIL PROTECTED] writes: That's probably unnecessary; I really don't expect them to ever use all 31 bytes that the IETF-standardized version of UTF-8 supports. 31 bits, rather. *sigh* But given that, modulo some debate over CJKV, we're getting into *really* obscure stuff already at only 94,140 characters, I'm guessing that there would have to be some really major and fundamental changes in written human communication before something more than two billion characters are used. Which doesn't mean rule out the possibility of ever expanding, since one should always leave that option open, but expending coding effort on it isn't worth it. Particularly since extending UTF-8 to more than 31 bits requires breaking some of the guarantees that UTF-8 makes, unless I'm missing how you're encoding the first byte so as not to give it a value of 0xFE. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/
Re: Should we care much about this Unicode-ish criticism?
On Tue, Jun 05, 2001 at 04:44:46PM -0700, Russ Allbery wrote: In the meantime, the normally-encountered working character set of modern Asian languages has been in Unicode from the beginning, and currently the older and rarer characters and the characters used these days only in proper names are being backfilled at a rate of tens of thousands per Unicode revision. How this can then be described as ignoring Asian languages boggles me beyond words. There are a lot of characters. It takes time. Rome wasn't built in a day. Also, remember what I wrote earlier: all the characters in the Chinese Han Yu Da Zidian and the Japanese Morohashi Dai Kanwa Jiten are now adopted into Unicode. -- If you do not wish your beer to be served without the traditional head, please ask for a top-up. With the subtext: Your traditional head will then exit via the traditional window. Arsehole. - Mark Dickerson
Re: Should we care much about this Unicode-ish criticism?
On Tue, Jun 05, 2001 at 04:44:46PM -0700, Russ Allbery wrote: NeonEdge [EMAIL PROTECTED] writes: This is evident in the Musical Symbols and even Byzantine Musical Symbols. Are these character sets more important than the actual language character sets being denied to the other countries? Are musical and mathematical symbols even a language at all? At the same time as 246 Byzantine Musical Symbols and 219 Musical Symbols were added, 43,253 Asian language ideographs were added. I fail to see the problem. Musical and mathematical symbols are certainly used more frequently than ancient Han ideographs that have been obsolete for 2,000 years, and it's not like the ideographs are having major difficulties being added to Unicode either. If the author of the original paper referred to here thinks there are still significant characters missing from Unicode, he should stop whining about it and put together a researched proposal. That's what the Byzantine music researchers did, and as a result their characters have now been added. This is how standardization works. You have to actually go do the work; you can't just complain and expect someone else to do it for you. (as a lurker in the unicode list ([EMAIL PROTECTED]), which also had the link to the opinion under discussion posted in there) Exactly. As another data point, once in a while in the list someone asks what about Egyptian hieroglyphics, Unicode can't be all-encompassing, nyahnyahnyah? Well, there the situation is that there *is* slowly ongoing work between the egyptologists and the Unicode people to get all the stork-atop-a-hippo-facing-left encoded, it's just that the egyptologists themselves have hard time agreeing what actually would be the canonical set of glyphs. There is a process for getting more characters into Unicode, but the Unicode people cannot be experts in all possible scripts. No proposals, no encodings. Another constant source of confusion (which is at least part of the Asian discontent) is that Unicode encodes abstract characters, not any particular rendering (fonts). (There are some exceptions to this, but they are mainly there to guarantee a safe round-trip to Unicode and back for legacy characters.) For example, bold-a is the same as italic-a is the same as plain-a. The same principle was behind the Han unification. Sometimes it would be preferable to decompose characters to be more flexible and future-proof For example the number of codepoints for Han could be dramatically reduced if there were an agreed-upon way to electronically decompose the glyphs to radicals-- but it seems (I am not an expert on this, mind) that there isn't, and we have to deal with dozens of thousands of them. -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'. # It is 'dead'. -- Jack Cohen
Re: Should we care much about this Unicode-ish criticism?
At 04:44 PM 6/5/2001 -0700, Larry Wall wrote: Dan Sugalski writes: : Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, : but that's in the Unicode 3.0 standard. Doesn't really matter where they install the artificial cap, because for philosophical reasons Perl is gonna support larger values anyway. It's just that 4 bytes of UTF-8 happens to be large enough to represent anything UTF-16 can represent with surrogates. So they refuse to believe in anything longer than 4 bytes, even though the representation can be extended much further. (Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!) I know we can, but is it really a good idea? 32 bits is really stretching it for character encoding, and 64 seems rather excessive. Really space-wasteful as well, if we maintain a character type with a fixed width large enough to hold the largest decoded variable-width character. And I really, *really* want to do as little as possible internally with variable-width encodings. Yech. They also arbitrarily define UTF-32 to not use higher values than 0x10, but that doesn't mean we're gonna send in the high-bit Nazis if people want higher values for their own purposes. Well, that'd be inappropriate since a good chunk of the rest of the set's been dedicated to future expansion. I think it might be a reasonable idea for -w to grumble if someone's used a character in the unassigned range, though. (IIRC there's a piece set aside for folks to do whatever they want with) But since the names UTF-8 and UTF-32 are becoming associated with those arbitrary restrictions, it's getting even more important to refer to Perl's looser style as utf8 (and, potentially, utf32). I don't know if Perl will have a utf16 that is distinguised from UTF-16. I'd as soon not do UTF-16 at all, or at least no more than we need to convert to UTF-32 or UTF-8. Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Stacks, registers, and bytecode. (Oh, my!)
At 07:40 AM 6/5/2001 -0700, Dave Storrs wrote: On Tue, 5 Jun 2001, Dave Mitchell wrote: dispatch loop. I'd much rather have a 'regex start' opcode which calls a separate dispath loop function, and which then interprets any further ops in the bytestream as regex ops. That way we double the number of 8-bit ops, and can have all the regex-specific state variables (s, send etc in the earlier example) and logic separated out. This is an interesting idea...could we use this more generally to multiply our number of opcodes? Basically, you have one set of opcodes for (e.g.) string parsing, one set for math, etc, all of which have the same value. Then you have a set of opcodes that tells the interpreter which opcode table to look in. Nah, that's too much work. We just allow folks to define their own opcode functions and assign each a lexically unique number, and dispatch to the function as appropriate. Adding and overriding opcodes is definitely in the cards, though in most cases it'll probably be an opcode version of a function call, since machine-level stuff would also require telling the compiler how to emit those opcodes. (Which folks writing python/ruby/rebol/cobol/fortran front ends for the interpreter might do) Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Should we care much about this Unicode-ish criticism?
Russ Allbery writes: : Particularly since extending UTF-8 to more : than 31 bits requires breaking some of the guarantees that UTF-8 makes, : unless I'm missing how you're encoding the first byte so as not to give it : a value of 0xFE. The UTF-16 BOMs, 0xFEFF and 0xFFFE, both turn out to be illegal UTF-8 in any case, so it doesn't much matter, assuming BOMs are used on UTF-16 that has to be auto-distinguished from UTF-8. (Doing any kind of auto-recognition on 16-bit data without BOMs is problematic in any case.) Larry
Re: Should we care much about this Unicode-ish criticism?
Dan Sugalski writes: : At 04:44 PM 6/5/2001 -0700, Larry Wall wrote: : (Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!) : : I know we can, but is it really a good idea? 32 bits is really stretching : it for character encoding, and 64 seems rather excessive. Such large values would not typically be used for standard characters, but as a means of embedding an inline chunk of non-character data, such as a pointer, or a set of metadata bits. : Really : space-wasteful as well, if we maintain a character type with a fixed width : large enough to hold the largest decoded variable-width character. True 'nuff. I suspect most people would want to stick within 32 bits, which is sufficiently wasteful for most purposes. : And I : really, *really* want to do as little as possible internally with : variable-width encodings. Yech. Mmm, the difficulty of that is overrated. Very seldom do you want to do anything other than find the next character, or the previous character, and those are pretty easy to do in utf8. : They also arbitrarily define UTF-32 to not use higher values than : 0x10, but that doesn't mean we're gonna send in the high-bit Nazis : if people want higher values for their own purposes. : : Well, that'd be inappropriate since a good chunk of the rest of the set's : been dedicated to future expansion. I think it might be a reasonable idea : for -w to grumble if someone's used a character in the unassigned range, : though. (IIRC there's a piece set aside for folks to do whatever they want : with) Certainly, but it's easy to come up with reasons to want to stuff more bits inline than the private use areas will support. Rather than have -w grumble about such characters, I'd rather see an optional output discipline that enforces strict Unicode output. : But since the names UTF-8 and UTF-32 are becoming associated with those : arbitrary restrictions, it's getting even more important to refer to : Perl's looser style as utf8 (and, potentially, utf32). I don't know : if Perl will have a utf16 that is distinguised from UTF-16. : : I'd as soon not do UTF-16 at all, or at least no more than we need to : convert to UTF-32 or UTF-8. Well, as you pointed out above, we might not use any kind of UTF internally, but just arrays of properly sized integers, which are never variable length. (UTF-32 is the only UTF that's not a variable-length encoding.) On the other hand, maybe there's some use for a data structure that is a sequence of integers of various sizes, where the representation of different chunks of the array/string might be different sizes. Would make some aspects of copy-on-write more efficient to be able to chunk strings and integer arrays. And of course this would all be transparent at the language level, in the absence of explicit syntax to treat an array as a string or a string as an array. Larry
Re: Should we care much about this Unicode-ish criticism?
Larry Wall [EMAIL PROTECTED] writes: Russ Allbery writes: Particularly since extending UTF-8 to more than 31 bits requires breaking some of the guarantees that UTF-8 makes, unless I'm missing how you're encoding the first byte so as not to give it a value of 0xFE. The UTF-16 BOMs, 0xFEFF and 0xFFFE, both turn out to be illegal UTF-8 in any case, so it doesn't much matter, assuming BOMs are used on UTF-16 that has to be auto-distinguished from UTF-8. (Doing any kind of auto-recognition on 16-bit data without BOMs is problematic in any case.) Yeah, but one of the guarantees of UTF-8 is: - The octet values FE and FF never appear. I can see that this property may not be that important, but it makes me feel like things that don't have this property aren't really UTF-8. -- Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/