Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-14 Thread Dominic Dunlop
At 18:00 +0200 2000-09-13, Philip Newton wrote: >What's Perl's take on characters where ord($c) > 0x, anyway? It seems to Just Work, as this one-ish-liner shows: % perl -we '$s.=chr(16**$_-1) for(1..9); \ printf "%#10x\n", ord($t) while $t=substr($s,0,1,"")' 0xf 0xff 0xf

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-14 Thread Philip Newton
On 14 Sep 2000, at 12:35, Dominic Dunlop wrote: > At 18:00 +0200 2000-09-13, Philip Newton wrote: > >What's Perl's take on characters where ord($c) > 0x, anyway? > > It seems to Just Work, as this one-ish-liner shows: [snip] In that case, if we want to go switch internal encoding UTF-8, we

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Philip Newton
On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote: > I would go for UCS-2 (UTF-16) as soon as possible as the preferred > internal encoding. You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates). What's Perl's take on characters where ord($c) > 0x, anyway? (These two issues

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Jarkko Hietaniemi
On Wed, Sep 13, 2000 at 06:00:55PM +0200, Philip Newton wrote: > On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote: > > > I would go for UCS-2 (UTF-16) as soon as possible as the preferred > > internal encoding. > > You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates). Surroga

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-13 Thread Nick Ing-Simmons
Simon Cozens <[EMAIL PROTECTED]> writes: >On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote: >> Nick Ing-Simmons <[EMAIL PROTECTED]> writes: >> >> > My stab at names would be: >> > >> > utf8bytes_to_chars() >> > >> > chars_to_utf8bytes(); >> >> That works for me. > >That

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Ed Batutis
Jarkko: >> >> It might be more useful if the default for the non-utf-8 characters >> were the system-defined default character encoding of the process ... > >I can understand the request but the problem is that for this to work >the legacy eight-bit mappings must first be implemented. ... I un

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Jarkko Hietaniemi
> To me the UTF8 flag is just a trick to improve the performance of Ahh, I'd say all the UTF-8 encoding and decoding is, if anything, degrading our performance. UTF-8 is space-spaving for US-ASCII, that's about the only redeeming feature of it, for all the other character ranges it is wasteful i

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Jarkko Hietaniemi
> Being paranoid is a good way to expose bugs. > > If we after some perl operation end up with a string where the UTF8 > flag is turned on and where the bytes do not represent a properly > encoded sequence, then _that_ is a bug and should be fixed. > True or false? True. But this principle shou

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Jarkko Hietaniemi
On Tue, Sep 12, 2000 at 02:37:02PM +0100, Simon Cozens wrote: > On Tue, Sep 12, 2000 at 08:19:39AM -0500, Jarkko Hietaniemi wrote: > > The biggest problem is that the ICU will not be everywhere. > > Unicode will not be everywhere either. By my mighty mathemagical magick I can prove that the inte

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Simon Cozens
On Tue, Sep 12, 2000 at 08:19:39AM -0500, Jarkko Hietaniemi wrote: > The biggest problem is that the ICU will not be everywhere. Unicode will not be everywhere either. -- TANSTAAFL

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Jarkko Hietaniemi
> If I say to_utf8() once more on it I expect to get a string containing > 4 chars: > > "\xC3\x82\xC2\xA9" > > and I expect from_utf8() to go the other way. Ahhh, no wonder we differed. My brain equates to_utf8(to_utf8()) with to_utf8(). > Your to_utf8() seems to be named after "turn-on-the

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 08:25:37PM -0700, Ed Batutis wrote: > >Please read Encode.pm. Mainly I'm interested hearing comments whether > >this is a good interface... > > I like the interface. No complicated options. > > It might be more useful if the default for the non-utf-8 characters > were t

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Graham Barr
On Tue, Sep 12, 2000 at 11:17:50AM +0100, Simon Cozens wrote: > On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote: > > Nick Ing-Simmons <[EMAIL PROTECTED]> writes: > > > > > My stab at names would be: > > > > > > utf8bytes_to_chars() > > > > > > chars_to_utf8bytes(); > >

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Simon Cozens
On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote: > Nick Ing-Simmons <[EMAIL PROTECTED]> writes: > > > My stab at names would be: > > > > utf8bytes_to_chars() > > > > chars_to_utf8bytes(); > > That works for me. That screams of getting the user *WAY* too involved with t

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Gisle Aas
Nick Ing-Simmons <[EMAIL PROTECTED]> writes: > My stab at names would be: > > utf8bytes_to_chars() > > chars_to_utf8bytes(); That works for me. Regards, Gisle

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Gisle Aas
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > > The is_utf8() also seem wrong to me. I believe that the SV invariant > > should be that a string marked with the UTF8 flag should not contain > > illegal UTF8 sequences. Why is it not so? > > I'm being paranoid. Keeps me alive. Being paranoid

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Gisle Aas
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > > I would like to see these convert perl strings to bytes: > > > > to_utf8 > > > > And these convert a sequence of bytes to perl strings: > > > > from_utf8 > > > > You seem to want to define these function the opposite way. Perhaps > > the

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Graham Barr
On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote: > > 1. Is there any chance of a null mapping to convert a string containing > > UTF8 but not marked to one so marked, and vice versa? > > "Define 'containing UTF8'. This string contains UTF8." > > I purposefully left that one o

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Nick Ing-Simmons
Gisle Aas <[EMAIL PROTECTED]> writes: >Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > >> Please take a look at the (very rough) first draft of Encode, an extension >> for character encoding conversions for Perl 5: >> >> http://www.iki.fi/jhi/Encode.tgz >> >> Download, plop it into the Perl

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-12 Thread Ed Batutis
>Please read Encode.pm. Mainly I'm interested hearing comments whether >this is a good interface... I like the interface. No complicated options. It might be more useful if the default for the non-utf-8 characters were the system-defined default character encoding of the process -rather than

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 09:08:05PM +0100, Markus Kuhn wrote: > Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC: > > > I've been working on the other end of it, which is the conversion to and from > > > other character sets - basically, the plan is to derive the data from the > > > > There's also

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
> I would like to see these convert perl strings to bytes: > > to_utf8 > > And these convert a sequence of bytes to perl strings: > > from_utf8 > > You seem to want to define these function the opposite way. Perhaps > the names are just too confusing. Even on second reading I do not foll

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Tue, Sep 12, 2000 at 12:24:50AM +0200, Gisle Aas wrote: > Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > > > Please take a look at the (very rough) first draft of Encode, an extension > > for character encoding conversions for Perl 5: > > > > http://www.iki.fi/jhi/Encode.tgz > > > > Dow

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Gisle Aas
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes: > Please take a look at the (very rough) first draft of Encode, an extension > for character encoding conversions for Perl 5: > > http://www.iki.fi/jhi/Encode.tgz > > Download, plop it into the Perl 5.7 source directory, unpack, > re-Configure

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
> A propos, I think in-place variants of to_utf8() and from_utf8() > might be in order. utf8_on() and utf8_off(), possibly. Maybe move > some of the functions to utf8.c, sv_cvtpv_to_utf8(), and > sv_cvtpv_from_utf8(), newSVpv_utf8(), newSVsv_to_utf8(), > newSVsv_from_utf8(), maybe. ...and some

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 11:21:42PM +0100, Simon Cozens wrote: > On Mon, Sep 11, 2000 at 05:17:21PM -0500, Jarkko Hietaniemi wrote: > > Hmmm, okay. I guess I need to start writing the utf8_on(), then. > > Please see > http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2000-09/msg0.html >

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Simon Cozens
On Mon, Sep 11, 2000 at 05:17:21PM -0500, Jarkko Hietaniemi wrote: > Hmmm, okay. I guess I need to start writing the utf8_on(), then. Please see http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2000-09/msg0.html which contains UTF8::Hack, which does this. -- Pray to God, but keep row

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 11:13:12PM +0100, Simon Cozens wrote: > On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote: > > > 1. Is there any chance of a null mapping to convert a string containing > > > UTF8 but not marked to one so marked, and vice versa? > > > > "Define 'containing

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Simon Cozens
On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote: > > 1. Is there any chance of a null mapping to convert a string containing > > UTF8 but not marked to one so marked, and vice versa? > > "Define 'containing UTF8'. This string contains UTF8." A string for which could_be_utf8 r

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
> Later, when we've got IO disciplines, we should be able to mark > an input handle to in utf8. s/ to / to be /; > A propos, I think in-place variants of to_utf8(0 and from_utf8() s/0/)/; -- $jhi++; # http://www.iki.fi/jhi/ # There is this special biologist word we use for 'stable'.

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 04:52:33PM -0500, [EMAIL PROTECTED] wrote: > >There isn't anything to test this with (I did say 'very rough'). Please > read > >Encode.pm. Mainly I'm interested hearing comments whether this is a good > > >interface, something that could be used to replace Unicode::Map

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Martin_Hosken
>There isn't anything to test this with (I did say 'very rough'). Please read >Encode.pm. Mainly I'm interested hearing comments whether this is a good >interface, something that could be used to replace Unicode::Map8 (lots of >table-driven conversions, for 8-bit legacy character sets), and

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Simon Cozens
On Mon, Sep 11, 2000 at 03:11:51PM -0500, Jarkko Hietaniemi wrote: > > Note: The Unicode mapping tables on > > > > http://www.unicode.org/Public/MAPPINGS/ > > > > are generally better reviewed and more up to date than these two > > alternative sources of conversion tables. Their format is also

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
On Mon, Sep 11, 2000 at 09:08:05PM +0100, Markus Kuhn wrote: > Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC: > > > I've been working on the other end of it, which is the conversion to and from > > > other character sets - basically, the plan is to derive the data from the > > > > There's also

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Markus Kuhn
Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC: > > I've been working on the other end of it, which is the conversion to and from > > other character sets - basically, the plan is to derive the data from the > > There's also RFC 1345, and > http://anubis.dkuug.dk/cultreg/registrations/chreg.htm

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Simon Cozens
On Mon, Sep 11, 2000 at 02:13:33PM -0500, Jarkko Hietaniemi wrote: > There isn't anything to test this with (I did say 'very rough'). > Please read Encode.pm. Mainly I'm interested hearing comments whether > this is a good interface, something that could be used to replace > Unicode::Map8 (lots o

Re: [EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
> I've been working on the other end of it, which is the conversion to and from > other character sets - basically, the plan is to derive the data from the There's also RFC 1345, and http://anubis.dkuug.dk/cultreg/registrations/chreg.htm > The legacy 8-bit stuff is trivial, and when my copy of C

[EXPERIMENTAL] 1st draft of Encode

2000-09-11 Thread Jarkko Hietaniemi
Please take a look at the (very rough) first draft of Encode, an extension for character encoding conversions for Perl 5: http://www.iki.fi/jhi/Encode.tgz Download, plop it into the Perl 5.7 source directory, unpack, re-Configure, rebuild. (Or, if you have a Perl 5.7 in your path, cd to