Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]
On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote: I thought Java used UTF-16. It's a variable-width encoding, so it should be fine. (Though I bet a lot of folks will be rather surprised when it happens...) Update: Since Unicode 3.1 (3.2 is the current version), there have in fact been defined characters outside the 16-bit range U+ to U+. For instance, the block U+1D100 to U+1D1FF contains musical symbols. Since Java 'char's are 16-bit quantities; characters outside of the range U+ to U+ have to be represented by pairs of characters from the 'surrogates' range, U+D800 through U+DFFF. Java does not handle this conversion transparently; for instance, the \u sequence to include a Unicode character code point takes exactly four hexadecimal digits. So to represent, e.g., U+1D107 MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually compose the surrogate pair (U+D834, U+DD07). This is a good thing from the point of view of the Java programmer since it means a 'char' is always the same size, even though it may not represent the entire desired character. In that, however, it is not fundamentally different from composition within the 16-bit range - e.g. composing 'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã', instead of using the single character U+00E3. Note that surrogates are bypassed when encoding in UTF-8; you just transform the desired code point directly, resulting in a UTF-8 sequence of four octets (characters through U+ require a maximum of three octets in UTF-8). Perl 5.6.1 already handles this correctly for \x{...} values greater than 0x; e.g. perl -e 'print \x{1d107}\n;' will output the four-byte UTF-8 encoding for that character. -- Mark REED| CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754 -- Going the speed of light is bad for your age.
Re: Perl 6, The Good Parts Version
pdcawley == pdcawley [EMAIL PROTECTED] writes: pdcawley Would I be right in thinking that it should be possible to implement a pdcawley prolog like language almost entirely within a regular expression? pdcawley Anyone want to step up to the plate? I've already done a Scheme proof pdcawley of concept after all... This is already a thread on perlmonks.org... see user ovid. -- Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095 [EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/ Perl/Unix/security consulting, Technical writing, Comedy, etc. etc. See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!
Re: Perl 6, The Good Parts Version
On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote: On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote: I don't know how Java and Python handle Unicode. Java has always been 100% Unicode from the ground up; it's in the spec. The fundamental char type is a 16-bit value, you can use any letterlike My understanding was that Unicode has now escaped the base plane (or whatever it's called) and now has started using code points 65536. How does Java cope with this? Nicholas Clark
Re: Perl 6, The Good Parts Version
At 4:17 PM +0100 7/17/02, Nicholas Clark wrote: On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote: On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote: I don't know how Java and Python handle Unicode. Java has always been 100% Unicode from the ground up; it's in the spec. The fundamental char type is a 16-bit value, you can use any letterlike My understanding was that Unicode has now escaped the base plane (or whatever it's called) and now has started using code points 65536. How does Java cope with this? I thought Java used UTF-16. It's a variable-width encoding, so it should be fine. (Though I bet a lot of folks will be rather surprised when it happens...) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Perl 6, The Good Parts Version
On Wed, Jul 17, 2002 at 04:17:15PM +0100, Nicholas Clark wrote: My understanding was that Unicode has now escaped the base plane (or whatever it's called) and now has started using code points 65536. How does Java cope with this? This is getting a little off-topic, I think. But here's a brief overview of the Unicode codespace size issue - if you have any more questions, you can ask me off-list. There were originally two separate universal character set efforts, by the ISO and the Unicode Consortium. They decided early on to combine their efforts and be mutually compatible. However, ISO-10646 was designed as a 32-bit code, consisting of 65,536 16-bit planes, while Unicode was only 16 bits. So Unicode is identical to plane 0 of ISO-10646, called the Basic Multilingual Plane (BMP). So far, the ISO has no characters defined outside of this plane. It does plan to define some eventually, however (in ISO-10646-2), and this is handled in Unicode through a section of the code space called surrogates, which are used in the UTF-16 encoding to reach planes 1-16 of ISO-10646. ISO has no plans to define characters outside of planes 1-16 anytime in the foreseeable future (or, indeed, outside of planes 1-14, since 15 and 16 are reserved for private use). -- Mark REED| CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754 -- The end of the world will occur at three p.m., this Friday, with symposium to follow.
Re: Perl 6, The Good Parts Version
On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote: I thought Java used UTF-16. It's a variable-width encoding, so it should be fine. (Though I bet a lot of folks will be rather surprised when it happens...) UTF-16 isn't technically a variable-width encoding, since surrogate codes are still considered single characters - even though they only have meaning when combined in pairs. It's much the same as multiple combining characters coming together to represent a single abstract entity that is also not really a character; the chief difference is that surrogates don't mean anything at all on their own. -- Mark REED| CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754 -- There are no rules for March. March is spring, sort of, usually. March means maybe, but don't bet on it.
Re: Perl 6, The Good Parts Version
At 12:34 PM -0400 7/17/02, Mark J. Reed wrote: On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote: I thought Java used UTF-16. It's a variable-width encoding, so it should be fine. (Though I bet a lot of folks will be rather surprised when it happens...) UTF-16 isn't technically a variable-width encoding, since surrogate codes are still considered single characters - even though they only have meaning when combined in pairs. It's much the same as multiple combining characters coming together to represent a single abstract entity that is also not really a character; the chief difference is that surrogates don't mean anything at all on their own. Yeah, I see that's how the standard defines it, but... Looks like a serious dodge to me. :) -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Perl 6, The Good Parts Version
On Wed, Jul 03, 2002 at 10:52:58PM +0100, Tim Bunce wrote: Don't forget Apocalypse 5. Personally I believe the elegant and thorough integration of regular expressions and backtracking into the large-scale logic of an application is one of the most radical things about Perl 6. How does one explain this to an audience that likely isn't convinced regexes are all that important in the first place? Sure it's line noise, but it's new and improved line noise! I may have to avoid the topic of regex improvements unless I can cover it in 5 minutes. Maybe a quick poll of how many people are using one of the many Perl5-like regex libraries, if there's a high portion then talk about the new regex stuff. Grammars, OTOH, is something I think I'll mention. I also forgot hyperoperators. Also it's likely worth mentioning that perl's method call syntax will switch to the dot making it look more like other languages. Unicode from the ground up is probably also worth mentioning, though I'm not quite sure what forms this will take other than Unicode will not be an awkward, bolt-on feature. I don't know how Java and Python handle Unicode. -- This sig file temporarily out of order.
Re: Perl 6, The Good Parts Version
On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote: I don't know how Java and Python handle Unicode. Java has always been 100% Unicode from the ground up; it's in the spec. The fundamental char type is a 16-bit value, you can use any letterlike characters in identifiers, there's an escape sequence to include untypable characters in strings, etc. I/O defaults to UTF-8 but you can arrange for other encodings. I don't know about Python. -- Mark REED| CNN Internet Technology 1 CNN Center Rm SW0831G | [EMAIL PROTECTED] Atlanta, GA 30348 USA | +1 404 827 4754 -- You're never too old to become younger. -- Mae West
Re: Perl 6, The Good Parts Version
What about parsing? I think the fact that Perl 6 will pretty much have parser capabilities built in is pretty distinctive. Ted When someone wants to write a parser, they turn to Perl 90% of the time (at least to prototype). The fact that they're really using a powerful lexer instead of a parser and don't ralize it is a sign of why regex and regex culture needed to change. But I don't think that advances in regexen, however great are what Pythonistas and Java Junkies need to hear to be convinced of Perl 6's power/usefullness. -Erik Is your boss reading your email? Probably Keep your messages private by using Lycos Mail. Sign up today at http://mail.lycos.com
Perl 6, The Good Parts Version
I've just submitted a short talk to the Scandinavian Conference on Java And Object Orientation (JAOO.org) [1] entitled Perl 6, The Good Parts. This talk will be given to an audience of mostly Java, Python and Ruby programmers with a smattering of XP Agile methodology folks and OO and Pattern gurus. It will try to convince them of two things: Perl 6 is not a joke (anymore). The Perl 6 language, design and implementation contains revolutionary ideas that you should pay attention to. I've been trying to pick out what parts of Perl 6 would make a Java programmer sit up and go I wish I had that or a Python programmer think Hmm, maybe there is more than one way to do it and, in fine Perl tradition, a few things which make the whole audience go what a bunch of fruitcakes! Here's what I've got: Parrot Both as our answer to the JVM and .Net and that the language design and the coding of the internals are going on simultaneously. Topicalizers Perl 5 has Do What I Mean, Perl 6 will have Ya Know What I Mean. A language which understands the concept of it. Community Funding A programming community with employees. $200,000 raised so far. Community Design The sometimes rocky process of design by community. Closures, Continuations, Currying, Everything Is An Object, Multimethod Dispatch, Slots, Introspection... Sure, other languages have these features, but all together in one language? Attributes Transcending mere objects and classes, Perl 6 introduces adverbs. Parrot, Funding and Design are pretty straight forward to explain to an audience of Java programmers. For the rest, I'm asking for help placing the proper spin on it. Topicalizers will be particularly tricky to explain without making it just sound like an opportunity to write more incomprehensible Perl code. I'm also trying to think of more bits to throw in. Particularly in terms of the OO system, this being a conference about OO. From what I've heard so far, Perl 6's OO system will be largely playing catch up with other languages. Hopefully the Cabal [2] can debunk that. What will Perl 6's class system offer that will impress a Java programmer? [1] I was invited to speak there last year by mistake and liked it so much I'm trying to weasel my way in again. [2] Of which there is none. -- This sig file temporarily out of order.
Re: Perl 6, The Good Parts Version
In a message dated Wed, 3 Jul 2002, Michael G Schwern writes: Attributes Transcending mere objects and classes, Perl 6 introduces adverbs. confused Attributes are adjectives, not adverbs. Aren't they? Trey
Re: Perl 6, The Good Parts Version
On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote: Hopefully the Cabal [2] can debunk that. [snip] [2] Of which there is none. and http://www.perlcabal.com/ doesn't exist, right? ;-) -- I do not resent critisism, even when, for the sake of emphasis, it parts for the time with reality. Winston Churchill, House of Commons, 22nd Jan 1941.
Re: Perl 6, The Good Parts Version
On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote: I'm also trying to think of more bits to throw in. Particularly in terms of the OO system, this being a conference about OO. From what I've heard so far, Perl 6's OO system will be largely playing catch up with other languages. Don't forget Apocalypse 5. Personally I believe the elegant and thorough integration of regular expressions and backtracking into the large-scale logic of an application is one of the most radical things about Perl 6. Tim.
Re: Perl 6, The Good Parts Version
Trey Harris wrote at Wed, 03 Jul 2002 19:44:45 +0200: In a message dated Wed, 3 Jul 2002, Michael G Schwern writes: Attributes Transcending mere objects and classes, Perl 6 introduces adverbs. confused Attributes are adjectives, not adverbs. Aren't they? Attributes describe the behaviour of sub routines, I think. As a sub routine is a Doing word - a verb, I would say an attribute can be an adverb. Of course an attribute of a variable is an adjective from this point of view :-) I think it's a possible point of view as there are many natural languages (e.g. German) that doesn't care a lot about the differences of adverbs to adjectives. Cheerio, Janek
Re: Perl 6, The Good Parts Version
At 9:20 PM +0100 7/3/02, Dave Mitchell wrote: On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote: Hopefully the Cabal [2] can debunk that. [snip] [2] Of which there is none. and http://www.perlcabal.com/ doesn't exist, right? ;-) Of course not. Otherwise it wouldn't 404, now would it? ;-P -- Dan --it's like this--- Dan Sugalski even samurai [EMAIL PROTECTED] have teddy bears and even teddy bears get drunk
Re: Perl 6, The Good Parts Version
On Wed, Jul 03, 2002 at 09:20:01PM +0100, Dave Mitchell wrote: On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote: Hopefully the Cabal [2] can debunk that. [snip] [2] Of which there is none. and http://www.perlcabal.com/ doesn't exist, right? ;-) Not Found The requested URL / was not found on this server. TINPC/1.3.14 Server at www.perlcabal.com Port 80 *snort* :) -- This sig file temporarily out of order.
Re: Perl 6, The Good Parts Version
On Wed, 3 Jul 2002, Janek Schleicher wrote: : Trey Harris wrote at Wed, 03 Jul 2002 19:44:45 +0200: : : In a message dated Wed, 3 Jul 2002, Michael G Schwern writes: : Attributes : Transcending mere objects and classes, Perl 6 introduces adverbs. : : confused Attributes are adjectives, not adverbs. Aren't they? : : Attributes describe the behaviour of sub routines, I think. : As a sub routine is a Doing word - a verb, : I would say an attribute can be an adverb. When a sub is declared, it's just an object, and so any properties you apply to it at that point really are functioning as adjectives. Perl 6 will support adverbs, but that's just a way to pass additional arguments to something like the range operator. It really does modify the operation, not the operator. And it's syntactically distinguished from adjectives. Admittedly the concepts mush together in many natural languages. But please don't continue to call the adjectives attributes. They're properties now. We're reserving the term attribute for object instance variables. That is, attributes are formally defined per-class, whereas properties are defined per-object on an ad hoc basis. It will reduce confusion if we can keep those terms straight. It was a mistake to call what Perl 5 has attributes, because that's a standard industry term for instance variables. Larry
Re: Perl 6, The Good Parts Version
On Wed, Jul 03, 2002 at 05:13:01PM -0400, Michael G Schwern wrote: On Wed, Jul 03, 2002 at 09:20:01PM +0100, Dave Mitchell wrote: On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote: Hopefully the Cabal [2] can debunk that. [snip] [2] Of which there is none. and http://www.perlcabal.com/ doesn't exist, right? ;-) Not Found The requested URL / was not found on this server. TINPC/1.3.14 Server at www.perlcabal.com Port 80 *snort* :) Odd how that text isn't what it seems... Tim.