Unicode 3.1, UTF-16, and Java [Re: Perl 6, The Good Parts Version]

2002-07-31 Thread Mark J. Reed

 On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
  I thought Java used UTF-16. It's a variable-width encoding, so it 
  should be fine. (Though I bet a lot of folks will be rather surprised 
  when it happens...)
Update:

Since Unicode 3.1 (3.2 is the current version), there have in fact
been defined characters outside the 16-bit range U+ to U+.
For instance, the block U+1D100 to U+1D1FF contains musical symbols.

Since Java 'char's are 16-bit quantities; characters outside of
the range U+ to U+ have to be represented by pairs of
characters from the 'surrogates' range, U+D800 through U+DFFF.
Java does not handle this conversion transparently; for instance,
the \u sequence to include a Unicode character code point
takes exactly four hexadecimal digits.  So to represent, e.g.,
U+1D107  MUSICAL SYMBOL RIGHT REPEAT SIGN, you have to manually
compose the surrogate pair (U+D834, U+DD07).  This is a good thing
from the point of view of the Java programmer since it means a
'char' is always the same size, even though it may not represent the
entire desired character.  In that, however, it is not fundamentally
different from composition within the 16-bit range - e.g. composing
'a' (U+0061) and the combining version of '~' (U+0303) to get 'ã',
instead of using the single character U+00E3.

Note that surrogates are bypassed when encoding in UTF-8;
you just transform the desired code point directly, resulting in a
UTF-8 sequence of four octets (characters through U+ require a
maximum of three octets in UTF-8).  Perl 5.6.1 already handles this
correctly for \x{...} values greater than 0x; e.g.  
perl -e 'print \x{1d107}\n;' will output the four-byte UTF-8 encoding
for that character.

-- 
Mark REED| CNN Internet Technology
1 CNN Center Rm SW0831G  | [EMAIL PROTECTED]
Atlanta, GA 30348  USA   | +1 404 827 4754 
--
Going the speed of light is bad for your age.



Re: Perl 6, The Good Parts Version

2002-07-18 Thread Randal L. Schwartz

 pdcawley == pdcawley  [EMAIL PROTECTED] writes:

pdcawley Would I be right in thinking that it should be possible to implement a
pdcawley prolog like language almost entirely within a regular expression?
pdcawley Anyone want to step up to the plate? I've already done a Scheme proof
pdcawley of concept after all...

This is already a thread on perlmonks.org... see user ovid.

-- 
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
[EMAIL PROTECTED] URL:http://www.stonehenge.com/merlyn/
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!



Re: Perl 6, The Good Parts Version

2002-07-17 Thread Nicholas Clark

On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote:
 On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
  I don't know how Java and Python handle Unicode.
 Java has always been 100% Unicode from the ground up; it's in the spec.
 The fundamental char type is a 16-bit value, you can use any letterlike

My understanding was that Unicode has now escaped the base plane (or whatever
it's called) and now has started using code points 65536. How does Java
cope with this?

Nicholas Clark



Re: Perl 6, The Good Parts Version

2002-07-17 Thread Dan Sugalski

At 4:17 PM +0100 7/17/02, Nicholas Clark wrote:
On Wed, Jul 17, 2002 at 12:32:43AM -0400, Mark J. Reed wrote:
  On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
   I don't know how Java and Python handle Unicode.
  Java has always been 100% Unicode from the ground up; it's in the spec.
  The fundamental char type is a 16-bit value, you can use any letterlike

My understanding was that Unicode has now escaped the base plane (or whatever
it's called) and now has started using code points 65536. How does Java
cope with this?

I thought Java used UTF-16. It's a variable-width encoding, so it 
should be fine. (Though I bet a lot of folks will be rather surprised 
when it happens...)
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Perl 6, The Good Parts Version

2002-07-17 Thread Mark J. Reed

On Wed, Jul 17, 2002 at 04:17:15PM +0100, Nicholas Clark wrote:
 My understanding was that Unicode has now escaped the base plane (or whatever
 it's called) and now has started using code points 65536. How does Java
 cope with this?
This is getting a little off-topic, I think.  But here's a brief overview
of the Unicode codespace size issue  - if you have any more questions,
you can ask me off-list.

There were originally two separate universal character set efforts,
by the ISO and the Unicode Consortium.  They decided early on to
combine their efforts and be mutually compatible. 

However, ISO-10646 was designed as a 32-bit code, consisting
of 65,536 16-bit planes, while Unicode was only 16 bits. 
So Unicode is identical to plane 0 of ISO-10646, called the
Basic Multilingual Plane (BMP).  So far, the ISO has no characters
defined outside of this plane.  

It does plan to define some eventually, however (in ISO-10646-2), and
this is handled in Unicode through a section of the code space called
surrogates, which are used in the UTF-16 encoding to reach planes
1-16 of ISO-10646.

ISO has no plans to define characters outside of planes 1-16 anytime
in the foreseeable future (or, indeed, outside of planes 1-14, since
15 and 16 are reserved for private use).

-- 
Mark REED| CNN Internet Technology
1 CNN Center Rm SW0831G  | [EMAIL PROTECTED]
Atlanta, GA 30348  USA   | +1 404 827 4754 
--
The end of the world will occur at three p.m., this Friday, with
symposium to follow.



Re: Perl 6, The Good Parts Version

2002-07-17 Thread Mark J. Reed

On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
 I thought Java used UTF-16. It's a variable-width encoding, so it 
 should be fine. (Though I bet a lot of folks will be rather surprised 
 when it happens...)
UTF-16 isn't technically a variable-width encoding, since
surrogate codes are still considered single characters - even
though they only have meaning when combined in pairs.  It's much
the same as multiple combining characters coming together to represent
a single abstract entity that is also not really a character; the
chief difference is that surrogates don't mean anything at all on their own.

-- 
Mark REED| CNN Internet Technology
1 CNN Center Rm SW0831G  | [EMAIL PROTECTED]
Atlanta, GA 30348  USA   | +1 404 827 4754 
--
There are no rules for March.  March is spring, sort of, usually.  March
means maybe, but don't bet on it.



Re: Perl 6, The Good Parts Version

2002-07-17 Thread Dan Sugalski

At 12:34 PM -0400 7/17/02, Mark J. Reed wrote:
On Wed, Jul 17, 2002 at 12:13:47PM -0400, Dan Sugalski wrote:
  I thought Java used UTF-16. It's a variable-width encoding, so it
  should be fine. (Though I bet a lot of folks will be rather surprised
  when it happens...)
UTF-16 isn't technically a variable-width encoding, since
surrogate codes are still considered single characters - even
though they only have meaning when combined in pairs.  It's much
the same as multiple combining characters coming together to represent
a single abstract entity that is also not really a character; the
chief difference is that surrogates don't mean anything at all on their own.

Yeah, I see that's how the standard defines it, but... Looks like a 
serious dodge to me. :)
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Perl 6, The Good Parts Version

2002-07-16 Thread Michael G Schwern

On Wed, Jul 03, 2002 at 10:52:58PM +0100, Tim Bunce wrote:
 Don't forget Apocalypse 5.

 Personally I believe the elegant and thorough integration of regular
 expressions and backtracking into the large-scale logic of an
 application is one of the most radical things about Perl 6.

How does one explain this to an audience that likely isn't convinced regexes
are all that important in the first place?  Sure it's line noise, but it's
new and improved line noise! I may have to avoid the topic of regex
improvements unless I can cover it in  5 minutes.  Maybe a quick poll of
how many people are using one of the many Perl5-like regex libraries, if
there's a high portion then talk about the new regex stuff.

Grammars, OTOH, is something I think I'll mention.

I also forgot hyperoperators.  Also it's likely worth mentioning that perl's
method call syntax will switch to the dot making it look more like other
languages.

Unicode from the ground up is probably also worth mentioning, though I'm not
quite sure what forms this will take other than Unicode will not be an
awkward, bolt-on feature.  I don't know how Java and Python handle Unicode.


-- 
This sig file temporarily out of order.



Re: Perl 6, The Good Parts Version

2002-07-16 Thread Mark J. Reed

On Tue, Jul 16, 2002 at 05:42:18PM -0400, Michael G Schwern wrote:
 I don't know how Java and Python handle Unicode.
Java has always been 100% Unicode from the ground up; it's in the spec.
The fundamental char type is a 16-bit value, you can use any letterlike
characters in identifiers, there's an escape sequence
to include untypable characters in strings, etc.  I/O defaults to UTF-8
but you can arrange for other encodings.

I don't know about Python.

-- 
Mark REED| CNN Internet Technology
1 CNN Center Rm SW0831G  | [EMAIL PROTECTED]
Atlanta, GA 30348  USA   | +1 404 827 4754 
--
You're never too old to become younger.
-- Mae West



Re: Perl 6, The Good Parts Version

2002-07-05 Thread Erik Steven Harrison


What about parsing?  I think the fact that Perl 6 will pretty much
have parser capabilities built in is pretty distinctive.

Ted

When someone wants to write a parser, they turn to Perl 90% of the time (at least to 
prototype). The fact that they're really using a powerful lexer instead of a parser 
and don't ralize it is a sign of why regex and regex culture needed to change. 

But I don't think that advances in regexen, however great are what Pythonistas and 
Java Junkies need to hear to be convinced of Perl 6's power/usefullness. 

-Erik


Is your boss reading your email? Probably
Keep your messages private by using Lycos Mail.
Sign up today at http://mail.lycos.com



Perl 6, The Good Parts Version

2002-07-03 Thread Michael G Schwern

I've just submitted a short talk to the Scandinavian Conference on Java And
Object Orientation (JAOO.org) [1] entitled Perl 6, The Good Parts.  This
talk will be given to an audience of mostly Java, Python and Ruby
programmers with a smattering of XP  Agile methodology folks and OO and
Pattern gurus.  It will try to convince them of two things:

 Perl 6 is not a joke (anymore).
 
 The Perl 6 language, design and implementation contains
 revolutionary ideas that you should pay attention to.

I've been trying to pick out what parts of Perl 6 would make a Java
programmer sit up and go I wish I had that or a Python programmer think
Hmm, maybe there is more than one way to do it and, in fine Perl
tradition, a few things which make the whole audience go what a bunch of
fruitcakes!

Here's what I've got:

Parrot
Both as our answer to the JVM and .Net and that the language design
and the coding of the internals are going on simultaneously.

Topicalizers
Perl 5 has Do What I Mean, Perl 6 will have Ya Know What I Mean.
A language which understands the concept of it.

Community Funding
A programming community with employees.  $200,000 raised so far.

Community Design
The sometimes rocky process of design by community.

Closures, Continuations, Currying, Everything Is An Object, Multimethod
Dispatch, Slots, Introspection...
Sure, other languages have these features, but all together in one 
language?

Attributes
Transcending mere objects and classes, Perl 6 introduces adverbs.


Parrot, Funding and Design are pretty straight forward to explain to an
audience of Java programmers.  For the rest, I'm asking for help placing the
proper spin on it.  Topicalizers will be particularly tricky to explain
without making it just sound like an opportunity to write more
incomprehensible Perl code.

I'm also trying to think of more bits to throw in.  Particularly in terms of
the OO system, this being a conference about OO.  From what I've heard so
far, Perl 6's OO system will be largely playing catch up with other
languages.  Hopefully the Cabal [2] can debunk that.  What will Perl 6's
class system offer that will impress a Java programmer?


[1] I was invited to speak there last year by mistake and liked it so much
I'm trying to weasel my way in again.

[2] Of which there is none.

-- 
This sig file temporarily out of order.



Re: Perl 6, The Good Parts Version

2002-07-03 Thread Trey Harris

In a message dated Wed, 3 Jul 2002, Michael G Schwern writes:
 Attributes
 Transcending mere objects and classes, Perl 6 introduces adverbs.

confused Attributes are adjectives, not adverbs.  Aren't they?

Trey




Re: Perl 6, The Good Parts Version

2002-07-03 Thread Dave Mitchell

On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote:
  Hopefully the Cabal [2] can debunk that.
[snip]
 [2] Of which there is none.

and http://www.perlcabal.com/ doesn't exist, right? ;-)

-- 
I do not resent critisism, even when, for the sake of emphasis,
it parts for the time with reality.
Winston Churchill, House of Commons, 22nd Jan 1941.



Re: Perl 6, The Good Parts Version

2002-07-03 Thread Tim Bunce

On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote:
 
 I'm also trying to think of more bits to throw in.  Particularly in terms of
 the OO system, this being a conference about OO.  From what I've heard so
 far, Perl 6's OO system will be largely playing catch up with other
 languages.

Don't forget Apocalypse 5.

Personally I believe the elegant and thorough integration of regular
expressions and backtracking into the large-scale logic of an
application is one of the most radical things about Perl 6.

Tim.




Re: Perl 6, The Good Parts Version

2002-07-03 Thread Janek Schleicher

Trey Harris wrote at Wed, 03 Jul 2002 19:44:45 +0200:

 In a message dated Wed, 3 Jul 2002, Michael G Schwern writes:
 Attributes
 Transcending mere objects and classes, Perl 6 introduces adverbs.
 
 confused Attributes are adjectives, not adverbs.  Aren't they?

Attributes describe the behaviour of sub routines, I think.
As a sub routine is a Doing word - a verb, 
I would say an attribute can be an adverb.

Of course an attribute of a variable 
is an adjective from this point of view :-)

I think it's a possible point of view as there are 
many natural languages (e.g. German) that doesn't 
care a lot about the differences of adverbs to adjectives.


Cheerio,
Janek



Re: Perl 6, The Good Parts Version

2002-07-03 Thread Dan Sugalski

At 9:20 PM +0100 7/3/02, Dave Mitchell wrote:
On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote:
   Hopefully the Cabal [2] can debunk that.
[snip]
  [2] Of which there is none.

and http://www.perlcabal.com/ doesn't exist, right? ;-)

Of course not. Otherwise it wouldn't 404, now would it? ;-P
-- 
 Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
   teddy bears get drunk



Re: Perl 6, The Good Parts Version

2002-07-03 Thread Michael G Schwern

On Wed, Jul 03, 2002 at 09:20:01PM +0100, Dave Mitchell wrote:
 On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote:
   Hopefully the Cabal [2] can debunk that.
 [snip]
  [2] Of which there is none.
 
 and http://www.perlcabal.com/ doesn't exist, right? ;-)

  Not Found
  The requested URL / was not found on this server.
  TINPC/1.3.14 Server at www.perlcabal.com Port 80

*snort* :)


-- 
This sig file temporarily out of order.



Re: Perl 6, The Good Parts Version

2002-07-03 Thread Larry Wall

On Wed, 3 Jul 2002, Janek Schleicher wrote:
: Trey Harris wrote at Wed, 03 Jul 2002 19:44:45 +0200:
: 
:  In a message dated Wed, 3 Jul 2002, Michael G Schwern writes:
:  Attributes
:  Transcending mere objects and classes, Perl 6 introduces adverbs.
:  
:  confused Attributes are adjectives, not adverbs.  Aren't they?
: 
: Attributes describe the behaviour of sub routines, I think.
: As a sub routine is a Doing word - a verb, 
: I would say an attribute can be an adverb.

When a sub is declared, it's just an object, and so any properties
you apply to it at that point really are functioning as adjectives.

Perl 6 will support adverbs, but that's just a way to pass additional
arguments to something like the range operator.  It really does modify
the operation, not the operator.  And it's syntactically distinguished
from adjectives.

Admittedly the concepts mush together in many natural languages.

But please don't continue to call the adjectives attributes.
They're properties now.  We're reserving the term attribute
for object instance variables.  That is, attributes are formally
defined per-class, whereas properties are defined per-object on an
ad hoc basis.  It will reduce confusion if we can keep those terms
straight.  It was a mistake to call what Perl 5 has attributes,
because that's a standard industry term for instance variables.

Larry




Re: Perl 6, The Good Parts Version

2002-07-03 Thread Tim Bunce

On Wed, Jul 03, 2002 at 05:13:01PM -0400, Michael G Schwern wrote:
 On Wed, Jul 03, 2002 at 09:20:01PM +0100, Dave Mitchell wrote:
  On Wed, Jul 03, 2002 at 01:23:24PM -0400, Michael G Schwern wrote:
Hopefully the Cabal [2] can debunk that.
  [snip]
   [2] Of which there is none.
  
  and http://www.perlcabal.com/ doesn't exist, right? ;-)
 
   Not Found
   The requested URL / was not found on this server.
   TINPC/1.3.14 Server at www.perlcabal.com Port 80
 
 *snort* :)

Odd how that text isn't what it seems...

Tim.