date:20010605

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Dave Mitchell


Larry Wall [EMAIL PROTECTED] wrote:
 It may certainly be valuable to (not) think of it that way, but just
 don't be surprised if the regex folks come along and borrow a lot of
 your opcodes to make things that look like (in C):
 
while (s  send  isdigit(*s)) s++; 

This is the bit that scares me about unifying perl ops and regex ops:

I see perl ops as relatively heavyweight things that can absorb the
costs of 'heavyweight' dispatch (function call overhead, etc etc), while
regex stuff needs to be very lightweight, eg

while (op = *optr++) {
switch (op) {
case FOO:
while (s  send  isdigit(*s)) s++;
case BAR:
while (s  send  isspace(*s)) s++;



can we really unify them without taking a performance hit?

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Simon Cozens


On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote:
 This is the bit that scares me about unifying perl ops and regex ops:
 can we really unify them without taking a performance hit?

Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like.

Second, Ruby uses a giant switch instead of function pointers for their
op despatch loop; Matz says it doesn't make that much difference in
terms of performance.

I don't know if I've mentioned this before, but 
http://www-6.ibm.com/jp/developerworks/linux/001027/ruby_qa.html
was my interview with Matsumoto about his ideas for Perl 6 and his experiences
from Ruby. It's in Japanese, so http://www.excite.co.jp/world/url/ may help.

-- 
Familiarity breeds facility.
-- Megahal (trained on asr), 1998-11-06

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Dave Mitchell


Simon Cozens [EMAIL PROTECTED] opined:
 On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote:
  This is the bit that scares me about unifying perl ops and regex ops:
  can we really unify them without taking a performance hit?
 
 Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like.
 
 Second, Ruby uses a giant switch instead of function pointers for their
 op despatch loop; Matz says it doesn't make that much difference in
 terms of performance.

I think it would be very messy to have both types of operands in the same
dispatch loop. I'd much rather have a 'regex start' opcode which
calls a separate dispath loop function, and which then interprets any
further ops in the bytestream as regex ops. That way we double the number
of 8-bit ops, and can have all the regex-specific state variables (s, send
etc in the earlier example) and logic separated out.

 I don't know if I've mentioned this before, but 
 http://www-6.ibm.com/jp/developerworks/linux/001027/ruby_qa.html
 was my interview with Matsumoto about his ideas for Perl 6 and his experiences
 from Ruby. It's in Japanese, so http://www.excite.co.jp/world/url/ may help.

A talk of jewelry
Perl developer From Mr. Simon Cozens to Ruby developer It is also as a pine
It dies and is the question and reply to Mr. [squiggle]

:-)

Re: PDD 2nd go: Conventions and Guidelines for Perl Source Code

2001-06-05 Thread Dave Storrs




On Tue, 5 Jun 2001, Hugo wrote:

 I'd also like to see a specification for indentation when breaking long
 lines. 

Fwiw, the style that I prefer is:

someFunc( really_long_param_1,
  (long_parm2 || parm3),
  really_long_other_param
);

or, for really complex expressions:

( really_long_param_1 
   (parm1 || long_parm1)
   ( yet_another_long_param
parm2 
(long_parm2 || parm3)
 )
);

Putting the final close paren on the next line makes it easier to
tell where the (sub)expression finishes.

Dave

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Dave Storrs




On Tue, 5 Jun 2001, Dave Mitchell wrote:

 dispatch loop. I'd much rather have a 'regex start' opcode which
 calls a separate dispath loop function, and which then interprets any
 further ops in the bytestream as regex ops. That way we double the number
 of 8-bit ops, and can have all the regex-specific state variables (s, send
 etc in the earlier example) and logic separated out.

This is an interesting idea...could we use this more generally to
multiply our number of opcodes?  Basically, you have one set of opcodes
for (e.g.) string parsing, one set for math, etc, all of which have the
same value.  Then you have a set of opcodes that tells the interpreter
which opcode table to look in.  The 'switching' opcodes then become
overhead, but if there aren't too many of those, perhaps its
acceptable.  And it would mean that we could specialize the opcodes a
great deal more (if, of course, that is desirable), and still have them
fit in an octet.

(Sorry if this is a stupid question, but please be patient; I've
never done internals stuff before.)

Dave

Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Dan Sugalski


Courtesy of Slashdot, 
http://www.hastingsresearch.com/net/04-unicode-limitations.shtml

I'm not sure if this is an issue for us or not, as we're generally 
language-neutral, and I don't see any technical issues with any of the 
UTF-* encodings having headroom problems.

It does argue for abstracting out the string handling code a bit so it can 
be replaced without completely rebuilding perl, but I'm not sure that it's 
that strong an argument. (Though it would be nice to upgrade perl from 
Unicode 3.1 to 3.2 with the equivalent of a module upgrade rather than a 
full rebuild)

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: PDD 2nd go: Conventions and Guidelines for Perl Source Code

2001-06-05 Thread Bart Lateur


On Tue, 29 May 2001 18:25:45 +0100 (BST), Dave Mitchell wrote:

diffs:

-KR style for indenting control constructs
+KR style for indenting control constructs: ie the closing C} should
+line up with the opening Cif etc.

On Wed, 30 May 2001 10:37:06 -0400, Dan Sugalski wrote:

I realize that no matter what style we choose, there will be a good crop of 
people who won't be thrilled with it. (For the record, we can count me as 
one, if that makes anyone feel any better :) That's inevitable.

If you have a diff/patching suite that falls over whitespace, you have a
problem with diff, not with style.

One can always to a pretty-print cleanup of the code, before doing the
diff, if all else fails.

IMO this is not worth bickering over.

-- 
Bart.

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Dan Sugalski


At 06:22 PM 6/5/2001 +0100, Simon Cozens wrote:
On Tue, Jun 05, 2001 at 10:17:08AM -0700, Russ Allbery wrote:
  Is it just me, or does this entire article reduce not to Unicode doesn't
  work but Unicode should assign more characters?

Yes. And Unicode has assigned more characters; it's factually challenged.

The other issue it actively brought up was the complaint about having to 
share glyphs amongst several languages, which didn't strike me as all that 
big a deal either, except perhaps as a matter of national pride and/or easy 
identification of the language of origin for a glyph. Not being literate in 
any of the languages in question, though, I didn't feel particularly 
qualified to make a judgement as to the validity of the complaints.

It does bring up a deeper issue, however. Unicode is, at the moment, 
apparently inadequate to represent at least some part of the asian 
languages. Are the encodings currently in use less inadequate? I've been 
assuming that an Anything-Unicode translation will be lossless, but this 
makes me wonder whether that assumption is correct.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

RE: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Hong Zhang


 On Tue, Jun 05, 2001 at 11:25:09AM +0100, Dave Mitchell wrote:
  This is the bit that scares me about unifying perl ops and regex ops:
  can we really unify them without taking a performance hit?
 
 Coupl'a things: firstly, we can make Perl 6 ops as lightweight as we like.
 
 Second, Ruby uses a giant switch instead of function pointers for their
 op despatch loop; Matz says it doesn't make that much difference in
 terms of performance.

Function pointer dispath is normally faster or as fast as switch. The main
down side is the context. A typical regular expression engine can pre-fetch
many variables into register local, they can be efficiently used by
all switch cases. However, the common context for regular expression is
relative small, I am not sure of the performance hit.

Hong

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Simon Cozens

On Tue, Jun 05, 2001 at 01:31:38PM -0400, Dan Sugalski wrote:
 The other issue it actively brought up was the complaint about having to 
 share glyphs amongst several languages, which didn't strike me as all that 
 big a deal either, except perhaps as a matter of national pride and/or easy 
 identification of the language of origin for a glyph. Not being literate in 
 any of the languages in question, though, I didn't feel particularly 
 qualified to make a judgement as to the validity of the complaints.

There are a number of related problems here; the Han unification effort has
pissed off some Asians on several counts. There easiest part to explain is
display; this isn't something that Perl particularly needs to care about, but
the same glyph may need to look different if it's in Chinese rather than in
Japanese. 

For the rest, I refer the assembly to my undergraduate dissertation :) :


Unicode itself is, like the JIS standard, simply an enumeration of
characters with their orderings; it says nothing about how the data is
represented to the computer, and must be supplemented by one of several
Unicode Transformation Formats which describe the encoding. 

However, despite the huge benefits to programmers worldwide, two
critical problems are hindering the adoption of Unicode amongst the
Japanese computer-using community. The first objection is technical, and
the second is more sociological. 

The technical objection stems from the fact that the Unicode Consortium
initially assigned a finite space for all Japanese, Chinese and Korean
characters, allowing only just under 28,000 characters. This space has
nearly been filled, with 20,902 basic characters already accepted, and
6,585 new characters under review; the situation is not going to get any
better as Chinese characters are invented for use in proper names and so
on. It is evident that 28,000 characters is not going to be anywhere
near enough, and programmers have felt betrayed that the promise of a
`fully Universal character set' will satisfy all other languages but
theirs. 

Thankfully, the Unicode Consortium has recently assigned another
extension plane for CJK characters and adopted a further 42,711
characters, meaning that all the characters in the Chinese Han Yu Da
Zidian and the Japanese Morohashi Dai Kanwa Jiten are now adopted into
Unicode. However, many programmers are unaware of the extension plane
and still feel that the Unicode Consortium is ignoring their plight.

More serious, however, is the decision to unify equivalent characters in
the Chinese, Japanese and Korean character sets into a single table
known as `Unihan'[10]. This has proved controversial primarily through
lack of understanding of the nature of `equivalent characters': the
Unihan table does not constitute a dumbing down of the character set, as
simplified and traditional forms of characters have been maintained.
However, Chinese and Japanese variants of the same single character have
been unified. The Unicode standard seeks to encode characters rather than
glyphs[11] , and hence the variant characters which come about due to
variations in writing style have been unified. On the other hand,
characters undergoing structural variance have not been unified. 

The principles on which Han Unification took place, are, according to
[Graham, 2000], not dissimilar to those used to unify characters in the
legacy JIS and other character sets. Three rules were used to determine
whether or not two kanji should be considered equivalent: 

Source Separation Rule 

If two kanji were distinct in a primary source character set (JIS in the
case of Japanese, GB2312-80 and other GB standards for Chinese,
KSC5601-1987 for Korean, and so on) then they should not be unified.
This would allow round-trip-conversion between Unicode and the original
source. For instance, the following variants of the character for
tsurugi, sword, were not unified: 

[Picture omitted]

Non-Cognate Rule Kanji

which are not cognate are not variants; this prohibits, for instance,
the unifiation of the following characters:

[Picture omitted]

Component Structure 

If a unification is acceptable under the above rules, unification is
only carried out if the characters share the same radicals and component
features, taking into consideration their arrangement. 


Using these rules, the CJK Joint Research Group of the ISO technical
committee on Unicode reduced a candidate 121,000 Han characters into
20,902 unique characters [12]. On the other hand, there are some valid
objections from Japanese, on three specific counts [13]: 

Firstly, the JIS standard defines, along with the ordering and
enumeration of its characters, their glyph shape. Unicode, on the other
hand does not. This means that as far as Unicode is concerned, there is
literally no distinction between two distinct shapes and hence no way to
specify which should be used. This becomes particularly emotive when one
is, for instance, attempting to

RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Hong Zhang


 Courtesy of Slashdot, 
 http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
 
 I'm not sure if this is an issue for us or not, as we're generally 
 language-neutral, and I don't see any technical issues with any of the 
 UTF-* encodings having headroom problems.

I think the author confused himself. The Unicode itself is not sufficient
to process human language, no matter how many characters it includes.
It is just an encoding.

Just take Chinese as example, only small percent (10%) of Chinese can
read more than 6000 characters. The biggest dictionary I know of includes
about 65000 characters, many of them even linguists can not agree with
each other. Some of the characters are kind of research result of the
authors. It is impossible to includes those characters into an 
international standard, such as Unicode. 

Unicode contains surrogates for future growth. We still have about 1M
code points left for allocation. Eventually it will include much more
characters than anyone can care about.

Hong

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Graham Barr


On Mon, Jun 04, 2001 at 06:04:10PM -0700, Larry Wall wrote:
 Well, other languages have explored that option, and I think that makes
 for an unnatural interface.  If you think of regexes as part of a
 larger language, you really want them to be as incestuous as possible,
 just as any other part of the language is incestuous with the rest of
 the language.  That's part of what I mean when I say that I'm trying to
 look at regular expressions as just a strange variant of Perl code.
 
 Looking at it from a slightly different angle, regular expressions are
 in great part control syntax, and library interfaces are lousy at
 implementing control.

Right. Having the regex opcodes be perl opcodes will certainly make
implementing (?{ ... }) much easier and probably faster too.

Also re references that we have now will become similar to subroutines
for pattern matching.

I think there are a lot of benefits to the re engine not to be
separate from the core perl ops.

Graham.

RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Hong Zhang


 Firstly, the JIS standard defines, along with the ordering and
 enumeration of its characters, their glyph shape. Unicode, on  the other
 hand does not. This means that as far as Unicode is concerned, there is
 literally no distinction between two distinct shapes and hence no way to
 specify which should be used. This becomes particularly emotive when one
 is, for instance, attempting to represent a person's name - 
 if they have a particular preferred variant character with which they
write their
 name, there is no way to communicate that to the computer, and
 information is lost. 

This is a very common practice, nothing to surprise. As you can tell,
my name is "hong zhang", which already lost "chinese tone" and
"glyph". "hong" has 4 tones, each tone can be any of several
characters, each character can be one of several glyphs (simplified and
tranditional). However, it does not really matter to still call it my name.

 The second objection is again related to character versus  glyph issues:
 since Chinese,

I think this problem =~ locale. For any unicode character, you can not
properly tell its lower case or upper case without considering locale.
And unicode does not encode locale.

 Finally, there is a historiographical issue; when computers are used to
 digitise and store historical literature containing archaic characters,
 specifying the exact variant character becomes an important
 consideration.

I believe this should be handled by application. This kind of work is needed
by research. Perl should not care about it.

Hong

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Bart Lateur


On 05 Jun 2001 11:07:11 -0700, Russ Allbery wrote:

Particularly since part of his contention is that 16 bits isn't enough,
and I think all the widely used national character sets are no more than
16 bits, aren't they?

It's not really important.

UTF-8 is NOT limited to 16 bits (3 bytes). With 4 bytes, UTF-8 can
represent 20 bit charatcers, i.e. 6 times more than the desired number
of 17. See http://czyborra.com/utf/#UTF-8 for how it this is done.

And the major flaw that I see in acceptance of Unicode, is that the
Unicode text files are not Ascii compatible. UTF-8 file are. That
makes for a very nice upgrade path.

-- 
Bart.

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Simon Cozens


On Tue, Jun 05, 2001 at 09:16:05PM +0200, Bart Lateur wrote:
 Unicode text files

No such animal. Unicode's a character repertoire, not an encoding.
See you at my Unicode tutorial at TPC? :)

-- 
buf[hdr[0]] = 0;/* unbelievably lazy ken (twit) */  - Andrew Hume

RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Dan Sugalski


At 11:18 AM 6/5/2001 -0700, Hong Zhang wrote:

  Firstly, the JIS standard defines, along with the ordering and
  enumeration of its characters, their glyph shape. Unicode, on  the other
  hand does not. This means that as far as Unicode is concerned, there is
  literally no distinction between two distinct shapes and hence no way to
  specify which should be used. This becomes particularly emotive when one
  is, for instance, attempting to represent a person's name -
  if they have a particular preferred variant character with which they
write their
  name, there is no way to communicate that to the computer, and
  information is lost.

This is a very common practice, nothing to surprise. As you can tell,
my name is hong zhang, which already lost chinese tone and
glyph. hong has 4 tones, each tone can be any of several
characters, each character can be one of several glyphs (simplified and
tranditional). However, it does not really matter to still call it my name.

I dunno. It's one thing to have a word represented with non-native 
characters--loss is expected. It's quite another to have it spelled out in 
an encoding that's supposed to preserve such things and have it not 
actually do that. That'd be like having my name spelled or pronounced 
differently because it was encoded in Unicode instead of ASCII. That's just 
plain wrong.

  The second objection is again related to character versus  glyph issues:
  since Chinese,

I think this problem =~ locale. For any unicode character, you can not
properly tell its lower case or upper case without considering locale.
And unicode does not encode locale.

Yeah, that is a problem. The alternative isn't any better, unfortunately. 
Human languages are a pain. :)

We're going to need case-translation stuff for perl 6, I think, if lc, uc, 
and its ilk are going to work properly.

  Finally, there is a historiographical issue; when computers are used to
  digitise and store historical literature containing archaic characters,
  specifying the exact variant character becomes an important
  consideration.

I believe this should be handled by application. This kind of work is needed
by research. Perl should not care about it.

I think I'd agree there. Different versions of a glyph are more a matter of 
art and handwriting styles, and that's not really something we ought to get 
involved in. The european equivalent would be to have many versions of A, 
so we could represent the different ways it was drawn in various 
illuminated manuscripts. That seems rather excessive.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Dan Sugalski


At 12:40 PM 6/5/2001 -0700, Russ Allbery wrote:
Bart Lateur [EMAIL PROTECTED] writes:
  UTF-8 is NOT limited to 16 bits (3 bytes).

That's an odd definition of byte you have there.  :)

Maybe it's RAD50. :) Still, it may take 3 bytes to represent in UTF-8 a 
character that takes 2 bytes in UTF-16.

  With 4 bytes, UTF-8 can represent 20 bit charatcers, i.e. 6 times more
  than the desired number of 17.

UTF-8 is a mapping from a 31-bit (yes, not 32, interestingly enough)
character numbering, and as such can represent over two billion
characters.  For some reason that I've never understood, the Unicode folks
are limiting that to only a subset of what one can do with 31 bits by
putting an artificial limit on how high of character values they're
willing to assign, but even with that as soon as they started using the
higher planes, there's easily enough space to add every character the
author mentioned and then some.

Yeah, the limitations are kind of odd. I'm presuming they're in there so 
the technical folks have at least some sort of a stick to smack the 
crankier non-technical folks with.

(As an aside, UTF-8 also is not an X-byte encoding; UTF-8 is a variable
byte encoding, with each character taking up anywhere from one to six
bytes in the encoded form depending on where in Unicode the character
falls.)

Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, 
but that's in the Unicode 3.0 standard.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread David L. Nicol


Graham Barr wrote:

 I think there are a lot of benefits to the re engine not to be
 separate from the core perl ops.


So does it start with a split(//,$bound_thing) or does it use
substr(...) with explicit offsets?

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Graham Barr


On Tue, Jun 05, 2001 at 03:31:24PM -0500, David L. Nicol wrote:
 Graham Barr wrote:
 
  I think there are a lot of benefits to the re engine not to be
  separate from the core perl ops.
 
 
 So does it start with a split(//,$bound_thing) or does it use
 substr(...) with explicit offsets?

Eh ?

Nobody is suggesting we implement re's using the current set of
perl ops, but that we extend the set with ops needed for re's. So
they use the same dispatch loop and that the ops can be intermixed

Graham.

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Bryan C . Warnock


On Tuesday 05 June 2001 03:24 pm, Dan Sugalski wrote:

   The second objection is again related to character versus  glyph
   issues: since Chinese,
 
 I think this problem =~ locale. For any unicode character, you can not
 properly tell its lower case or upper case without considering locale.
 And unicode does not encode locale.

 Yeah, that is a problem. The alternative isn't any better, unfortunately.
 Human languages are a pain. :)

 We're going to need case-translation stuff for perl 6, I think, if lc, uc,
 and its ilk are going to work properly.


Yes,  we've discussed this off and on for various things - character class 
identification, sorting, comparison, case-translation.

Where do you draw the line, lines, and/or default line?  I'd like Perl to be 
able to handle textual information, and not just do character manipulation, 
but that doesn't mean at the core level.

Some additional stuff to ponder over, and maybe Unicode addresses these - I 
haven't been able to read *all* the Unicode stuff yet.  (And, yes, Simon, you
will see me in class.)

Some languages don't have upper or lower case.  Are tests and translations 
on caseless characters true or false?  (Or undefined?)  

Should the same Unicode character, when used in two different languages, be 
string equivalent?  

Asciibetical order is one thing, as it (roughly) maps alphabetical order for 
English.  But unless you've been blessed with a root language for Unicode 
mapping (such as Arabic), Unicodical sorting is going to be non-sensical, as 
you hop between your language variants and the characters encoded somewhere 
else (as in Farsi).  And, of course, there are several different orderings 
for eastern glyph languages, IINM.

But I think it'd be too heavy to make Perl inherently locale-aware.  The 
best, I think, would be to have Perl simply be Unicode neutral - to treat 
the characters (with any equivalencies, etc) as just data - and to allow 
locale modules to replace or supplement the ops/functions/* that *is* locale 
aware.

That would allow all the locale-specific handling code to be 
written/debugged/distributed separately from the core on its own timeframe.  
It would ultimately lead to a little more consistency, since everyone can 
use a common handler instead of rolling your own.  No need to have locale 
handlers for locales you won't use.

Of course, being Unicode neutral, that still leaves some stuff (like case 
determination) undefined.  So maybe there should be a default locale in 
place - the current, or barring that, English, I suppose.

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Simon Cozens


On Tue, Jun 05, 2001 at 05:39:36PM -0400, Bryan C . Warnock wrote:
 Some languages don't have upper or lower case.  Are tests and translations 
 on caseless characters true or false?  (Or undefined?)  

I'd say undefined.

 Should the same Unicode character, when used in two different languages, be 
 string equivalent?  

YES. Definitely. Same Unicode character, same thing. You wanted something
else, use a different Unicode character.

 Asciibetical order is one thing, as it (roughly) maps alphabetical order for 
 English.  But unless you've been blessed with a root language for Unicode 
 mapping (such as Arabic), Unicodical sorting is going to be non-sensical, as 
 you hop between your language variants and the characters encoded somewhere 
 else (as in Farsi).  And, of course, there are several different orderings 
 for eastern glyph languages, IINM.

Not our problem. There are collation sequences within the the various
subsets, and these'll work fine if we go by UTR#10. If you ask for
a non-sensical comparison between two different languages, you'll get
one.
 
 But I think it'd be too heavy to make Perl inherently locale-aware.  The 
 best, I think, would be to have Perl simply be Unicode neutral - to treat 
 the characters (with any equivalencies, etc) as just data

Strongly agree.

 That would allow all the locale-specific handling code to be 
 written/debugged/distributed separately from the core on its own timeframe.  

Strongly agree.

 Of course, being Unicode neutral, that still leaves some stuff (like case 
 determination) undefined.  So maybe there should be a default locale in 
 place - the current, or barring that, English, I suppose.

Default to ASCII-ish and make it very, very easy for locale handling
modules to override the various pieces of the puzzle.

-- 
It can be hard to tell an English bigot from a monoglot with an
inferiority complex, but one cannot tell a Welshman any thing a 
tall.
- Geraint Jones.

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Dan Sugalski [EMAIL PROTECTED] writes:
 At 12:40 PM 6/5/2001 -0700, Russ Allbery wrote:

 (As an aside, UTF-8 also is not an X-byte encoding; UTF-8 is a variable
 byte encoding, with each character taking up anywhere from one to six
 bytes in the encoded form depending on where in Unicode the character
 falls.)

 Have they changed that again? Last I checked, UTF-8 was capped at 4
 bytes, but that's in the Unicode 3.0 standard.

Yes, it changed with Unicode 3.1 when they started allocating characters
from higher planes.

Far and away the best reference for UTF-8 that I've found is RFC 2279.
It's much more concise and readable than the version in the Unicode
standard, and is more aimed at implementors and practical considerations.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Bryan C Warnock [EMAIL PROTECTED] writes:

 Some additional stuff to ponder over, and maybe Unicode addresses these
 - I haven't been able to read *all* the Unicode stuff yet.  (And, yes,
 Simon, you will see me in class.)

 Some languages don't have upper or lower case.  Are tests and
 translations on caseless characters true or false?  (Or undefined?)

Caseless characters should be guaranteed unchanged by conversion to upper
or lower case, IMO.  Case is a normative property of characters in
Unicode, so case mappings should actually be pretty well-defined.  Note
that there are actually three cases in Unicode, upper, lower, and title
case, since there are some characters that require the third distinction
(stuff like Dz is generally used as an example).

 Should the same Unicode character, when used in two different languages,
 be string equivalent?

The way to start solving this whole problem is probably through
normalization; Unicode defines two separate normalizations, one of which
collapses more similar characters than the other.  One is designed to
preserve formatting information while the other loses formatting
information.  (The best example of how they differ is that one leaves the
ffi ligature alone and the other breaks it down into three separate
characters.)  Perl should allow programmers to choose their preferred
normalization schemes or none at all.

(There are really four normalization schemes; in two of them, you leave
things fully decomposed, and in the other two you recompose characters as
much as possible.)

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Bryan C . Warnock


On Tuesday 05 June 2001 05:49 pm, Simon Cozens wrote:
 YES. Definitely. Same Unicode character, same thing. You wanted something
 else, use a different Unicode character.

I don't understand.  There *is* only one character.  I can't choose another. 
 Take 0x0648, for instance.  It's both waw, the 27th letter of the Arabic 
alphabet, and veh, the 30th letter of the Persian alphabet, which aren't the 
same letter.  Same character, different letters.  Equivalent, or different?  
In Unicode, or locale independent, they're the same, I've no problem with 
that.  Within one locale or the other. I'm not so sure.  I think it 
needs to be able to go both ways, with equivalence perhaps being the default.
(Perhaps this need only be so simple as to be able to tag and query (via 
attributes, for instance) the language of the string, and handling the logic 
yourself.  If the languages differ, no sense in comparing, yadda yadda 
yadda.  Then again, whether it is a difference or not may also be a language 
issue.  I'd be inclined to think that waw and veh are different, but Gift 
(in English) and Gift (in Gernan) are the same.  To me, those are the same 
characters and same letters (even though, I guess, technically they are 
not), with just different meanings.)  In either case (or perhaps it is an 
extension of the same case), each locale should be able to specify and 
handle its own determination of equivalency.

As I watch everyone talk about the eastern languages, and I think of the 
middle eastern languages, I realize what a mess this potentially is.  For 
the most part, Hong is right - it's for the applications to handle.  But I 
think that we need to have a clear understanding of what we're asking the 
applications to handle, in an effort to make the hard things easy.  (And for 
some of these languages, it can be quite hard.)

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Simon Cozens [EMAIL PROTECTED] writes:
 On Tue, Jun 05, 2001 at 03:27:03PM -0700, Russ Allbery wrote:

 Caseless characters should be guaranteed unchanged by conversion to
 upper or lower case, IMO.

 I think Bryan's asking more about \p{IsUpper} than uc().

Ahh... well, Unicode classifies them for us, yes?  Lowercase, Uppercase,
Titlecase, and Other, IIRC.  So a caseless character wouldn't show up in
either IsLower or IsUpper.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

RE: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread NeonEdge


The problem as I see it, is not that the mechanism can't handle the languages,
it is that the Latin/Gothic countries chose first, and gave what's left to the
Oriental countries.

This is evident in the Musical Symbols and even Byzantine Musical Symbols.
Are these character sets more important than the actual language character sets
being denied to the other countries? Are musical and mathematical symbols even
a language at all?

Yes, I understand that they are in the sense that they convey information, but
if Unicode is only trying to generically represent common use language, then
some of the characters (perhaps sets) should go. And if we go the other way and
say that this is intended to represent every sort of written, spoken, or
symbolic communication, then it really opens up the floodgates (I need a
character for the men's room sign, please).

Here are some questions for English speakers to ask themselves about Unicode:
Are the original ascii graphical characters somehow more worthy of inclusion
than the Chinese characters?
Aren't Unicode 0xBD (the one-half character) and 1/2 the same?
When was the last time that you saw the cent sign on a computer?
When was the last time that you saw the cent sign anywhere?

It seems to me that Unicode, in it's present form, although a valiant attempt,
is just a 'better' ascii, and not a complete solution.

Grant M.

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


NeonEdge [EMAIL PROTECTED] writes:

 This is evident in the Musical Symbols and even Byzantine Musical
 Symbols.  Are these character sets more important than the actual
 language character sets being denied to the other countries? Are musical
 and mathematical symbols even a language at all?

At the same time as 246 Byzantine Musical Symbols and 219 Musical Symbols
were added, 43,253 Asian language ideographs were added.  I fail to see
the problem.

Musical and mathematical symbols are certainly used more frequently than
ancient Han ideographs that have been obsolete for 2,000 years, and it's
not like the ideographs are having major difficulties being added to
Unicode either.

If the author of the original paper referred to here thinks there are
still significant characters missing from Unicode, he should stop whining
about it and put together a researched proposal.  That's what the
Byzantine music researchers did, and as a result their characters have now
been added.  This is how standardization works.  You have to actually go
do the work; you can't just complain and expect someone else to do it for
you.

In the meantime, the normally-encountered working character set of modern
Asian languages has been in Unicode from the beginning, and currently the
older and rarer characters and the characters used these days only in
proper names are being backfilled at a rate of tens of thousands per
Unicode revision.  How this can then be described as ignoring Asian
languages boggles me beyond words.  There are a lot of characters.  It
takes time.  Rome wasn't built in a day.

 It seems to me that Unicode, in it's present form, although a valiant
 attempt, is just a 'better' ascii, and not a complete solution.

It seems to me that you haven't bothered to go look at what Unicode is
actually doing.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Larry Wall


Dan Sugalski writes:
: Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes, 
: but that's in the Unicode 3.0 standard.

Doesn't really matter where they install the artificial cap, because
for philosophical reasons Perl is gonna support larger values anyway.
It's just that 4 bytes of UTF-8 happens to be large enough to represent
anything UTF-16 can represent with surrogates.  So they refuse to
believe in anything longer than 4 bytes, even though the representation
can be extended much further.  (Perl 5 extends it all the way to 64-bit
values, represented in 13 bytes!)

They also arbitrarily define UTF-32 to not use higher values than
0x10, but that doesn't mean we're gonna send in the high-bit Nazis
if people want higher values for their own purposes.

But since the names UTF-8 and UTF-32 are becoming associated with those
arbitrary restrictions, it's getting even more important to refer to
Perl's looser style as utf8 (and, potentially, utf32).  I don't know
if Perl will have a utf16 that is distinguised from UTF-16.

Larry

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Larry Wall [EMAIL PROTECTED] writes:

 Doesn't really matter where they install the artificial cap, because for
 philosophical reasons Perl is gonna support larger values anyway.  It's
 just that 4 bytes of UTF-8 happens to be large enough to represent
 anything UTF-16 can represent with surrogates.  So they refuse to
 believe in anything longer than 4 bytes, even though the representation
 can be extended much further.  (Perl 5 extends it all the way to 64-bit
 values, represented in 13 bytes!)

That's probably unnecessary; I really don't expect them to ever use all 31
bytes that the IETF-standardized version of UTF-8 supports.

 I don't know if Perl will have a utf16 that is distinguised from UTF-16.

I wouldn't bother spending any time on UTF-16 beyond basic support for
converting away from it.  It combines the worst of both worlds, and I
don't expect it to be used much now that they've buried the idea of
keeping Unicode to 16 bits.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Russ Allbery [EMAIL PROTECTED] writes:

 That's probably unnecessary; I really don't expect them to ever use all
 31 bytes that the IETF-standardized version of UTF-8 supports.

31 bits, rather.  *sigh*

But given that, modulo some debate over CJKV, we're getting into *really*
obscure stuff already at only 94,140 characters, I'm guessing that there
would have to be some really major and fundamental changes in written
human communication before something more than two billion characters are
used.  Which doesn't mean rule out the possibility of ever expanding,
since one should always leave that option open, but expending coding
effort on it isn't worth it.  Particularly since extending UTF-8 to more
than 31 bits requires breaking some of the guarantees that UTF-8 makes,
unless I'm missing how you're encoding the first byte so as not to give it
a value of 0xFE.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Simon Cozens


On Tue, Jun 05, 2001 at 04:44:46PM -0700, Russ Allbery wrote:
 In the meantime, the normally-encountered working character set of modern
 Asian languages has been in Unicode from the beginning, and currently the
 older and rarer characters and the characters used these days only in
 proper names are being backfilled at a rate of tens of thousands per
 Unicode revision.  How this can then be described as ignoring Asian
 languages boggles me beyond words.  There are a lot of characters.  It
 takes time.  Rome wasn't built in a day.

Also, remember what I wrote earlier:

  all the characters in the Chinese Han Yu Da Zidian and the Japanese
  Morohashi Dai Kanwa Jiten are now adopted into Unicode.

-- 
If you do not wish your beer to be served without the traditional head,
please ask for a top-up. With the subtext: Your traditional head will 
then exit via the traditional window. Arsehole.
- Mark Dickerson

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Jarkko Hietaniemi


On Tue, Jun 05, 2001 at 04:44:46PM -0700, Russ Allbery wrote:
 NeonEdge [EMAIL PROTECTED] writes:
 
  This is evident in the Musical Symbols and even Byzantine Musical
  Symbols.  Are these character sets more important than the actual
  language character sets being denied to the other countries? Are musical
  and mathematical symbols even a language at all?
 
 At the same time as 246 Byzantine Musical Symbols and 219 Musical Symbols
 were added, 43,253 Asian language ideographs were added.  I fail to see
 the problem.
 
 Musical and mathematical symbols are certainly used more frequently than
 ancient Han ideographs that have been obsolete for 2,000 years, and it's
 not like the ideographs are having major difficulties being added to
 Unicode either.
 
 If the author of the original paper referred to here thinks there are
 still significant characters missing from Unicode, he should stop whining
 about it and put together a researched proposal.  That's what the
 Byzantine music researchers did, and as a result their characters have now
 been added.  This is how standardization works.  You have to actually go
 do the work; you can't just complain and expect someone else to do it for
 you.

(as a lurker in the unicode list ([EMAIL PROTECTED]), which also had
 the link to the opinion under discussion posted in there)

Exactly.

As another data point, once in a while in the list someone asks what
about Egyptian hieroglyphics, Unicode can't be all-encompassing,
nyahnyahnyah?  Well, there the situation is that there *is* slowly
ongoing work between the egyptologists and the Unicode people to get
all the stork-atop-a-hippo-facing-left encoded, it's just that the
egyptologists themselves have hard time agreeing what actually would
be the canonical set of glyphs.  There is a process for getting more
characters into Unicode, but the Unicode people cannot be experts in
all possible scripts.  No proposals, no encodings.

Another constant source of confusion (which is at least part of
the Asian discontent) is that Unicode encodes abstract characters,
not any particular rendering (fonts).  (There are some exceptions
to this, but they are mainly there to guarantee a safe round-trip
to Unicode and back for legacy characters.)  For example, bold-a
is the same as italic-a is the same as plain-a.  The same principle
was behind the Han unification.  Sometimes it would be preferable
to decompose characters to be more flexible and future-proof
For example the number of codepoints for Han could be dramatically
reduced if there were an agreed-upon way to electronically decompose
the glyphs to radicals-- but it seems (I am not an expert on this,
mind) that there isn't, and we have to deal with dozens of thousands
of them.

-- 
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
# It is 'dead'. -- Jack Cohen

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Dan Sugalski


At 04:44 PM 6/5/2001 -0700, Larry Wall wrote:
Dan Sugalski writes:
: Have they changed that again? Last I checked, UTF-8 was capped at 4 bytes,
: but that's in the Unicode 3.0 standard.

Doesn't really matter where they install the artificial cap, because
for philosophical reasons Perl is gonna support larger values anyway.
It's just that 4 bytes of UTF-8 happens to be large enough to represent
anything UTF-16 can represent with surrogates.  So they refuse to
believe in anything longer than 4 bytes, even though the representation
can be extended much further.  (Perl 5 extends it all the way to 64-bit
values, represented in 13 bytes!)

I know we can, but is it really a good idea? 32 bits is really stretching 
it for character encoding, and 64 seems rather excessive. Really 
space-wasteful as well, if we maintain a character type with a fixed width 
large enough to hold the largest decoded variable-width character. And I 
really, *really* want to do as little as possible internally with 
variable-width encodings. Yech.

They also arbitrarily define UTF-32 to not use higher values than
0x10, but that doesn't mean we're gonna send in the high-bit Nazis
if people want higher values for their own purposes.

Well, that'd be inappropriate since a good chunk of the rest of the set's 
been dedicated to future expansion. I think it might be a reasonable idea 
for -w to grumble if someone's used a character in the unassigned range, 
though. (IIRC there's a piece set aside for folks to do whatever they want 
with)

But since the names UTF-8 and UTF-32 are becoming associated with those
arbitrary restrictions, it's getting even more important to refer to
Perl's looser style as utf8 (and, potentially, utf32).  I don't know
if Perl will have a utf16 that is distinguised from UTF-16.

I'd as soon not do UTF-16 at all, or at least no more than we need to 
convert to UTF-32 or UTF-8.

Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Stacks, registers, and bytecode. (Oh, my!)

2001-06-05 Thread Dan Sugalski


At 07:40 AM 6/5/2001 -0700, Dave Storrs wrote:
On Tue, 5 Jun 2001, Dave Mitchell wrote:

  dispatch loop. I'd much rather have a 'regex start' opcode which
  calls a separate dispath loop function, and which then interprets any
  further ops in the bytestream as regex ops. That way we double the number
  of 8-bit ops, and can have all the regex-specific state variables (s, send
  etc in the earlier example) and logic separated out.

 This is an interesting idea...could we use this more generally to
multiply our number of opcodes?  Basically, you have one set of opcodes
for (e.g.) string parsing, one set for math, etc, all of which have the
same value.  Then you have a set of opcodes that tells the interpreter
which opcode table to look in.

Nah, that's too much work. We just allow folks to define their own opcode 
functions and assign each a lexically unique number, and dispatch to the 
function as appropriate.

Adding and overriding opcodes is definitely in the cards, though in most 
cases it'll probably be an opcode version of a function call, since 
machine-level stuff would also require telling the compiler how to emit 
those opcodes. (Which folks writing python/ruby/rebol/cobol/fortran front 
ends for the interpreter might do)


Dan

--it's like this---
Dan Sugalski  even samurai
[EMAIL PROTECTED] have teddy bears and even
  teddy bears get drunk

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Larry Wall


Russ Allbery writes:
: Particularly since extending UTF-8 to more
: than 31 bits requires breaking some of the guarantees that UTF-8 makes,
: unless I'm missing how you're encoding the first byte so as not to give it
: a value of 0xFE.

The UTF-16 BOMs, 0xFEFF and 0xFFFE, both turn out to be illegal UTF-8
in any case, so it doesn't much matter, assuming BOMs are used on
UTF-16 that has to be auto-distinguished from UTF-8.  (Doing any kind of
auto-recognition on 16-bit data without BOMs is problematic in any case.)

Larry

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Larry Wall


Dan Sugalski writes:
: At 04:44 PM 6/5/2001 -0700, Larry Wall wrote:
: (Perl 5 extends it all the way to 64-bit values, represented in 13 bytes!)
: 
: I know we can, but is it really a good idea? 32 bits is really stretching 
: it for character encoding, and 64 seems rather excessive.

Such large values would not typically be used for standard characters, but
as a means of embedding an inline chunk of non-character data, such as a
pointer, or a set of metadata bits.

: Really 
: space-wasteful as well, if we maintain a character type with a fixed width 
: large enough to hold the largest decoded variable-width character.

True 'nuff.  I suspect most people would want to stick within 32 bits,
which is sufficiently wasteful for most purposes.

: And I 
: really, *really* want to do as little as possible internally with 
: variable-width encodings. Yech.

Mmm, the difficulty of that is overrated.  Very seldom do you want to
do anything other than find the next character, or the previous
character, and those are pretty easy to do in utf8.

: They also arbitrarily define UTF-32 to not use higher values than
: 0x10, but that doesn't mean we're gonna send in the high-bit Nazis
: if people want higher values for their own purposes.
: 
: Well, that'd be inappropriate since a good chunk of the rest of the set's 
: been dedicated to future expansion. I think it might be a reasonable idea 
: for -w to grumble if someone's used a character in the unassigned range, 
: though. (IIRC there's a piece set aside for folks to do whatever they want 
: with)

Certainly, but it's easy to come up with reasons to want to stuff more
bits inline than the private use areas will support.  Rather than have
-w grumble about such characters, I'd rather see an optional output
discipline that enforces strict Unicode output.

: But since the names UTF-8 and UTF-32 are becoming associated with those
: arbitrary restrictions, it's getting even more important to refer to
: Perl's looser style as utf8 (and, potentially, utf32).  I don't know
: if Perl will have a utf16 that is distinguised from UTF-16.
: 
: I'd as soon not do UTF-16 at all, or at least no more than we need to 
: convert to UTF-32 or UTF-8.

Well, as you pointed out above, we might not use any kind of UTF
internally, but just arrays of properly sized integers, which are never
variable length.  (UTF-32 is the only UTF that's not a variable-length
encoding.)

On the other hand, maybe there's some use for a data structure that is
a sequence of integers of various sizes, where the representation of
different chunks of the array/string might be different sizes.  Would
make some aspects of copy-on-write more efficient to be able to chunk
strings and integer arrays.  And of course this would all be transparent
at the language level, in the absence of explicit syntax to treat an
array as a string or a string as an array.

Larry

Re: Should we care much about this Unicode-ish criticism?

2001-06-05 Thread Russ Allbery


Larry Wall [EMAIL PROTECTED] writes:
 Russ Allbery writes:

 Particularly since extending UTF-8 to more than 31 bits requires
 breaking some of the guarantees that UTF-8 makes, unless I'm missing
 how you're encoding the first byte so as not to give it a value of
 0xFE.

 The UTF-16 BOMs, 0xFEFF and 0xFFFE, both turn out to be illegal UTF-8 in
 any case, so it doesn't much matter, assuming BOMs are used on UTF-16
 that has to be auto-distinguished from UTF-8.  (Doing any kind of
 auto-recognition on 16-bit data without BOMs is problematic in any
 case.)

Yeah, but one of the guarantees of UTF-8 is:

   -  The octet values FE and FF never appear.

I can see that this property may not be that important, but it makes me
feel like things that don't have this property aren't really UTF-8.

-- 
Russ Allbery ([EMAIL PROTECTED]) http://www.eyrie.org/~eagle/

37 matches

Mail list logo