Re: Locales: An Analysis

2000-02-03 Thread Larry Wall
Chip Salzenberg writes: : So: The _string_encoding_ state of each OP must be one of these: : : 0. the default -- follow each string's current encoding : 1. "use byte" -- all strings are one-byte : 2. "use utf8" -- all strings are UTF-8 (*not* necessarily Unicode!) There is no 2. : And t

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Chip Salzenberg writes: : > Not really a superset anymore, unless you're into defining your own : > characters outside of U+10. : : I don't understand... Could someone point me to a description of the : current Unicode <-> ISO 10646 relationship? Well, http://www.unicode.org/unicode/standard

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Russ Allbery writes: : FWIW, from the standards front, the next revision of the news standards : will almost certainly be standardizing on UTF-8 as the character set for : headers (headers being particularly tricky since while you can use MIME to : specify a character set for the body, doing the s

Re: Locales: An Analysis

2000-02-05 Thread Larry Wall
Johan Vromans writes: : Larry Wall <[EMAIL PROTECTED]> writes: : : > If a subject has more than 50% high-bit characters in the subject, : > it goes straight into my spam mailbox without trying any of the : > other heuristics. : : I use 'more than 5 high-bit characters in

Re: Locales: An Analysis

2000-02-05 Thread Larry Wall
Bart Schuller writes: : On Fri, Feb 04, 2000 at 09:21:04AM -0800, Tim Bray wrote: : > It should be noted that over in Java-land, UTF-16 is more or less the : > native dialect, and UTF-8 is a royal pain in the butt to deal with. Sigh. : : I was just reading up on the Java Native Interface and the

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Tim Bray writes: : BTW, should ord($c) return different values depending on whether or not : I've said "use utf8;"? The short answer is no. The medium answer is that you'll have to say "use byte" if you want ord($c) to return the first byte rather than the first character. The long answer is

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Ilya Zakharevich writes: : On Fri, Feb 04, 2000 at 12:12:25AM -0800, Chip Salzenberg wrote: : > > > So: The _string_encoding_ state of each OP must be one of these: : > > > 0. the default -- follow each string's current encoding : > > > 1. "use byte" -- all strings are one-byte : > > > 2. "

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Gurusamy Sarathy writes: : Treating literals as utf8 is a bit of a compatibility issue, but : I think we should get around that by treating the lex input stream : as any other discipline. IOW, default PL_rsfp to byte mode, : and let users push a utf8/utf16/whatever discipline on it if they : wann

Re: Locales: An Analysis

2000-02-04 Thread Larry Wall
Tom Christiansen writes: : >Well, I hope they enforce it. We're starting to get all sorts of : >gobbledygook in the subjects of mail messages. I'd love it if mailers : >rejected messages whose headers contain illegal UTF-8 sequences. : : That's not too hard to do. :-) Technologically, yes. Bu