At 18:00 +0200 2000-09-13, Philip Newton wrote:
>What's Perl's take on characters where ord($c) > 0x, anyway?
It seems to Just Work, as this one-ish-liner shows:
% perl -we '$s.=chr(16**$_-1) for(1..9); \
printf "%#10x\n", ord($t) while $t=substr($s,0,1,"")'
0xf
0xff
0xf
On 14 Sep 2000, at 12:35, Dominic Dunlop wrote:
> At 18:00 +0200 2000-09-13, Philip Newton wrote:
> >What's Perl's take on characters where ord($c) > 0x, anyway?
>
> It seems to Just Work, as this one-ish-liner shows:
[snip]
In that case, if we want to go switch internal encoding UTF-8, we
On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote:
> I would go for UCS-2 (UTF-16) as soon as possible as the preferred
> internal encoding.
You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates).
What's Perl's take on characters where ord($c) > 0x, anyway?
(These two issues
On Wed, Sep 13, 2000 at 06:00:55PM +0200, Philip Newton wrote:
> On 12 Sep 2000, at 11:57, Jarkko Hietaniemi wrote:
>
> > I would go for UCS-2 (UTF-16) as soon as possible as the preferred
> > internal encoding.
>
> You know, of course, that UCS-2 ne UTF-16 (specifically, surrogates).
Surroga
Simon Cozens <[EMAIL PROTECTED]> writes:
>On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote:
>> Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
>>
>> > My stab at names would be:
>> >
>> > utf8bytes_to_chars()
>> >
>> > chars_to_utf8bytes();
>>
>> That works for me.
>
>That
Jarkko:
>>
>> It might be more useful if the default for the non-utf-8 characters
>> were the system-defined default character encoding of the process ...
>
>I can understand the request but the problem is that for this to work
>the legacy eight-bit mappings must first be implemented. ...
I un
> To me the UTF8 flag is just a trick to improve the performance of
Ahh, I'd say all the UTF-8 encoding and decoding is, if anything,
degrading our performance. UTF-8 is space-spaving for US-ASCII,
that's about the only redeeming feature of it, for all the other
character ranges it is wasteful i
> Being paranoid is a good way to expose bugs.
>
> If we after some perl operation end up with a string where the UTF8
> flag is turned on and where the bytes do not represent a properly
> encoded sequence, then _that_ is a bug and should be fixed.
> True or false?
True. But this principle shou
On Tue, Sep 12, 2000 at 02:37:02PM +0100, Simon Cozens wrote:
> On Tue, Sep 12, 2000 at 08:19:39AM -0500, Jarkko Hietaniemi wrote:
> > The biggest problem is that the ICU will not be everywhere.
>
> Unicode will not be everywhere either.
By my mighty mathemagical magick I can prove that the inte
On Tue, Sep 12, 2000 at 08:19:39AM -0500, Jarkko Hietaniemi wrote:
> The biggest problem is that the ICU will not be everywhere.
Unicode will not be everywhere either.
--
TANSTAAFL
> If I say to_utf8() once more on it I expect to get a string containing
> 4 chars:
>
> "\xC3\x82\xC2\xA9"
>
> and I expect from_utf8() to go the other way.
Ahhh, no wonder we differed. My brain equates to_utf8(to_utf8()) with
to_utf8().
> Your to_utf8() seems to be named after "turn-on-the
On Mon, Sep 11, 2000 at 08:25:37PM -0700, Ed Batutis wrote:
> >Please read Encode.pm. Mainly I'm interested hearing comments whether
> >this is a good interface...
>
> I like the interface. No complicated options.
>
> It might be more useful if the default for the non-utf-8 characters
> were t
On Tue, Sep 12, 2000 at 11:17:50AM +0100, Simon Cozens wrote:
> On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote:
> > Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
> >
> > > My stab at names would be:
> > >
> > > utf8bytes_to_chars()
> > >
> > > chars_to_utf8bytes();
> >
On Tue, Sep 12, 2000 at 11:24:47AM +0200, Gisle Aas wrote:
> Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
>
> > My stab at names would be:
> >
> > utf8bytes_to_chars()
> >
> > chars_to_utf8bytes();
>
> That works for me.
That screams of getting the user *WAY* too involved with t
Nick Ing-Simmons <[EMAIL PROTECTED]> writes:
> My stab at names would be:
>
> utf8bytes_to_chars()
>
> chars_to_utf8bytes();
That works for me.
Regards,
Gisle
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
> > The is_utf8() also seem wrong to me. I believe that the SV invariant
> > should be that a string marked with the UTF8 flag should not contain
> > illegal UTF8 sequences. Why is it not so?
>
> I'm being paranoid. Keeps me alive.
Being paranoid
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
> > I would like to see these convert perl strings to bytes:
> >
> > to_utf8
> >
> > And these convert a sequence of bytes to perl strings:
> >
> > from_utf8
> >
> > You seem to want to define these function the opposite way. Perhaps
> > the
On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote:
> > 1. Is there any chance of a null mapping to convert a string containing
> > UTF8 but not marked to one so marked, and vice versa?
>
> "Define 'containing UTF8'. This string contains UTF8."
>
> I purposefully left that one o
Gisle Aas <[EMAIL PROTECTED]> writes:
>Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>
>> Please take a look at the (very rough) first draft of Encode, an extension
>> for character encoding conversions for Perl 5:
>>
>> http://www.iki.fi/jhi/Encode.tgz
>>
>> Download, plop it into the Perl
>Please read Encode.pm. Mainly I'm interested hearing comments whether
>this is a good interface...
I like the interface. No complicated options.
It might be more useful if the default for the non-utf-8 characters were the
system-defined default character encoding of the process -rather than
On Mon, Sep 11, 2000 at 09:08:05PM +0100, Markus Kuhn wrote:
> Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC:
> > > I've been working on the other end of it, which is the conversion to and from
> > > other character sets - basically, the plan is to derive the data from the
> >
> > There's also
> I would like to see these convert perl strings to bytes:
>
> to_utf8
>
> And these convert a sequence of bytes to perl strings:
>
> from_utf8
>
> You seem to want to define these function the opposite way. Perhaps
> the names are just too confusing.
Even on second reading I do not foll
On Tue, Sep 12, 2000 at 12:24:50AM +0200, Gisle Aas wrote:
> Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
>
> > Please take a look at the (very rough) first draft of Encode, an extension
> > for character encoding conversions for Perl 5:
> >
> > http://www.iki.fi/jhi/Encode.tgz
> >
> > Dow
Jarkko Hietaniemi <[EMAIL PROTECTED]> writes:
> Please take a look at the (very rough) first draft of Encode, an extension
> for character encoding conversions for Perl 5:
>
> http://www.iki.fi/jhi/Encode.tgz
>
> Download, plop it into the Perl 5.7 source directory, unpack,
> re-Configure
> A propos, I think in-place variants of to_utf8() and from_utf8()
> might be in order. utf8_on() and utf8_off(), possibly. Maybe move
> some of the functions to utf8.c, sv_cvtpv_to_utf8(), and
> sv_cvtpv_from_utf8(), newSVpv_utf8(), newSVsv_to_utf8(),
> newSVsv_from_utf8(), maybe.
...and some
On Mon, Sep 11, 2000 at 11:21:42PM +0100, Simon Cozens wrote:
> On Mon, Sep 11, 2000 at 05:17:21PM -0500, Jarkko Hietaniemi wrote:
> > Hmmm, okay. I guess I need to start writing the utf8_on(), then.
>
> Please see
> http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2000-09/msg0.html
>
On Mon, Sep 11, 2000 at 05:17:21PM -0500, Jarkko Hietaniemi wrote:
> Hmmm, okay. I guess I need to start writing the utf8_on(), then.
Please see
http://www.xray.mpe.mpg.de/mailing-lists/perl-unicode/2000-09/msg0.html
which contains UTF8::Hack, which does this.
--
Pray to God, but keep row
On Mon, Sep 11, 2000 at 11:13:12PM +0100, Simon Cozens wrote:
> On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote:
> > > 1. Is there any chance of a null mapping to convert a string containing
> > > UTF8 but not marked to one so marked, and vice versa?
> >
> > "Define 'containing
On Mon, Sep 11, 2000 at 05:09:22PM -0500, Jarkko Hietaniemi wrote:
> > 1. Is there any chance of a null mapping to convert a string containing
> > UTF8 but not marked to one so marked, and vice versa?
>
> "Define 'containing UTF8'. This string contains UTF8."
A string for which could_be_utf8 r
> Later, when we've got IO disciplines, we should be able to mark
> an input handle to in utf8.
s/ to / to be /;
> A propos, I think in-place variants of to_utf8(0 and from_utf8()
s/0/)/;
--
$jhi++; # http://www.iki.fi/jhi/
# There is this special biologist word we use for 'stable'.
On Mon, Sep 11, 2000 at 04:52:33PM -0500, [EMAIL PROTECTED] wrote:
> >There isn't anything to test this with (I did say 'very rough'). Please
> read
> >Encode.pm. Mainly I'm interested hearing comments whether this is a good
>
> >interface, something that could be used to replace Unicode::Map
>There isn't anything to test this with (I did say 'very rough'). Please
read
>Encode.pm. Mainly I'm interested hearing comments whether this is a good
>interface, something that could be used to replace Unicode::Map8 (lots of
>table-driven conversions, for 8-bit legacy character sets), and
On Mon, Sep 11, 2000 at 03:11:51PM -0500, Jarkko Hietaniemi wrote:
> > Note: The Unicode mapping tables on
> >
> > http://www.unicode.org/Public/MAPPINGS/
> >
> > are generally better reviewed and more up to date than these two
> > alternative sources of conversion tables. Their format is also
On Mon, Sep 11, 2000 at 09:08:05PM +0100, Markus Kuhn wrote:
> Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC:
> > > I've been working on the other end of it, which is the conversion to and from
> > > other character sets - basically, the plan is to derive the data from the
> >
> > There's also
Jarkko Hietaniemi wrote on 2000-09-11 19:43 UTC:
> > I've been working on the other end of it, which is the conversion to and from
> > other character sets - basically, the plan is to derive the data from the
>
> There's also RFC 1345, and
> http://anubis.dkuug.dk/cultreg/registrations/chreg.htm
On Mon, Sep 11, 2000 at 02:13:33PM -0500, Jarkko Hietaniemi wrote:
> There isn't anything to test this with (I did say 'very rough').
> Please read Encode.pm. Mainly I'm interested hearing comments whether
> this is a good interface, something that could be used to replace
> Unicode::Map8 (lots o
> I've been working on the other end of it, which is the conversion to and from
> other character sets - basically, the plan is to derive the data from the
There's also RFC 1345, and
http://anubis.dkuug.dk/cultreg/registrations/chreg.htm
> The legacy 8-bit stuff is trivial, and when my copy of C
Please take a look at the (very rough) first draft of Encode, an extension
for character encoding conversions for Perl 5:
http://www.iki.fi/jhi/Encode.tgz
Download, plop it into the Perl 5.7 source directory, unpack,
re-Configure, rebuild. (Or, if you have a Perl 5.7 in your path,
cd to
38 matches
Mail list logo