Re: Locales: An Analysis

2000-02-04 Thread Gurusamy Sarathy

On Fri, 04 Feb 2000 09:52:20 PST, Larry Wall wrote:
The long answer is that we're phasing out the experimental "use utf8"
declaration.

The status as of 640 is that only two things are affected by
Cuse utf8: interpretation of literals/identifiers in the source text;
and how REs are compiled.  Both should go away.

Having it affect the interpretation of identifiers is a bit bogus,
since high-bit chars have never been allowed in them before, so
we could just always interpret them as utf8.

Treating literals as utf8 is a bit of a compatibility issue, but
I think we should get around that by treating the lex input stream
as any other discipline.  IOW, default PL_rsfp to byte mode,
and let users push a utf8/utf16/whatever discipline on it if they
wanna.  (This would apply to identifiers as well.)

Converting the RE code to compile down to polymorphic ops still needs a
bit of work, by my reckoning.  Ilya, you hearing me?  :-)


Sarathy
[EMAIL PROTECTED]



Re: Locales: An Analysis

2000-02-04 Thread Larry Wall

Gurusamy Sarathy writes:
: Treating literals as utf8 is a bit of a compatibility issue, but
: I think we should get around that by treating the lex input stream
: as any other discipline.  IOW, default PL_rsfp to byte mode,
: and let users push a utf8/utf16/whatever discipline on it if they
: wanna.  (This would apply to identifiers as well.)

They can always push a discipline explcitly, but I think we should just
auto-recognize it by default.  Start with a generic discipline that
doesn't commit, then when you see a high bit, look and see if you've
got illegal utf8.  If you do, it's not utf8.  If you don't, it is
99.99% certain (in Latin-1 countries) to be utf8, and you can look ahead
some more if you want to be more certain.  In non-Latin-1 countries
you probably have to push an explicit discipline anyway to tell it which
of many character sets you're using if you're not using utf8, so it
should still default to utf8 if it sees legal utf8.

If you'd like to think of it in stronger terms, the script is in utf8
until proven otherwise.  Just start with the utf8 discipline by default
and make it smart enough to recover from errors by switching to an
8-bit/binary discipline if we haven't already committed too much to utf8.

This will also be a useful discipline for files of unknown provenance,
I think, but we probably wouldn't make it the default discipline for
ordinary files.  It might possibly be the default utf8 discipline,
though there might be some call for a utf8_darnit discipline that would
puke on non-utf8 rather than trying to switch.

Larry



Re: Locales: An Analysis

2000-02-04 Thread Ilya Zakharevich

On Fri, Feb 04, 2000 at 11:21:36AM -0800, Larry Wall wrote:
 Tom Christiansen writes:
 : Well, I hope they enforce it.  We're starting to get all sorts of
 : gobbledygook in the subjects of mail messages.  I'd love it if mailers
 : rejected messages whose headers contain illegal UTF-8 sequences.
 : 
 : That's not too hard to do. :-)
 
 Technologically, yes.  But culturally we have to get buy in from
 everyone who currently sends Latin-1 in headers.

Or KOI-8.  Or win-125..

Ilya