Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later

2001-09-06 Thread Gisle Aas

Paul Kulchenko <[EMAIL PROTECTED]> writes:

> Hi, Gisle!
> 
> > Correct.  The best fix is to not provide such strings.  syswrite()
> > refuse to deal with them too.
>
> That effectively means that I can't send utf8 encoded strings over
> the wire (5.7.x croaks and 5.6.x doesn't work correctly). I need to
> convert them first, right?

Right.

> That's where we started. I can definitely
> do it in my application/module, but I would expect that LWP::Protocol
> will take care about it. Am I wrong?

In my view, yes.

> > There is no obvious way to deal with strings containing chars
> > outside 0..255 as _binary strings_.
>
> Ok, the question is WHO should convert them? App? LWP::UserAgent?
> LWP::Protocol? IO::Socket? Noone? 

I would say the App.

An alternative could be to have the content() method in HTTP::Message
auto convert when it finds high chars.  The problem with this is that
it does not have enough information for selecting what encoding to
use.  From what I understand, you want it to just use UTF8 (because
that happen to be the current internal representation of wide chars).

Regards,
Gisle



Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later

2001-09-06 Thread Gisle Aas

Paul Kulchenko <[EMAIL PROTECTED]> writes:

> --- Gisle Aas <[EMAIL PROTECTED]> wrote:
> > What you want is LWP to deal with strings with the _internal_ UTF8
> > flag set.  In my view LWP can't guess what encoding to apply to
> Quite opposite. I want that LWP deals with string as binary strings
> regardless of used encoding.

There is no obvious way to deal with strings containing chars outside
0..255 as _binary strings_.

> length() used there deals with string as
> set of chars instead of bytes, thus making impossible for LWP to
> specify proper content-length and call sysread/syswrite with proper
> size.

Correct.  The best fix is to not provide such strings.  syswrite()
refuse to deal with them too.

> > serialize that kind of string.  UTF-8 is not really a more obvious
> > choice than UTF-16.  If it happens to be an image that somehow got
> > the
> > UTF8 flag set then any UTF-encoding would be wrong, as the string
> > should simply be utf8_downgraded to be ok again.
> Absolutely. That's exactly what should be done imho. String should be
> downgraded to set of bytes.

But only when all chars are 0..255.

> > > What do you expect me to do?
> > If you want the string UTF8 encoded, then say so explicitly:
> >  $req = HTTP::Request->new(POST => $endpoint);
> >  $req->content_type("text/plain; charset='utf8'");
> >  $req->content(Encode::encode_utf8($utf8));
> But that's precisely what I do. Content is being specified
> incorrectly because length() calculates chars on utf8-encoded
> strings. That's where we started.

You did not have the Encode::encode_utf8() call.  If you do everything
should be fine, and all chars in the content will be in range 0..255.

Regards,
Gisle



Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later

2001-09-06 Thread Gisle Aas

Nick Ing-Simmons <[EMAIL PROTECTED]> writes:

> Gisle Aas <[EMAIL PROTECTED]> writes:
> >Paul Kulchenko <[EMAIL PROTECTED]> writes:
> >
> >> [Problem]
> >> 1. LWP::Protocol::http and others use length() to calculate
> >> content-length and in Perl 5.6.1 and later length() calculates chars
> >> instead of bytes. It means that every request that has multibyte
> >> chars in it will have wrong content-length and other side will read
> >> less bytes than required. 
> >
> >I my view it is a bug to put content containing chars with ord() > 255
> >in the the content of a HTTP::Request.  If you want UTF8 encoded stuff
> >you should put UTF8 encoded stuff in the content.  Don't expect perl
> >to magically guess.  You should use Encode::encode_utf8($str) or
> >something like it.
> >
> >If there was an easy way I would like to add a
> >
> >  sv_utf8_downgrade($req->content, 0);
> 
>utf8::downgrade($req->content, 0);
> 
> for perl 5.7.* for large-enough *

I guess I could do something like

utf8::downgrade($req->content, 0) if defined &utf8::downgrade;

then.  Might want to add this to the HTTP::Message->content() method
so it croaks as soon as you try to put wide characters in.

n> >
> >to the LWP::Protocol code.  This would make requests with such chars
> >in them fail early.  I think the write call on the socket ought to do
> >the downgrade and croaking for me though.
> 
> Again I think 5.7.* branch should do that - it was certainly the intent
> (it may only warn)

I just verified that syswrite does indeed croak in bleedperl.

This program:


require LWP::UserAgent;
my $ua = LWP::UserAgent->new;

my $req = HTTP::Request->new(POST => 'http://localhost/test.cgi');
$req->content_type("text/plain");
$req->content(v200.300.400);

my $res = $ua->request($req);
print $res->as_string;


prints:

500 (Internal Server Error) Wide character in syswrite
Client-Date: Thu, 06 Sep 2001 20:11:41 GMT



Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later

2001-09-04 Thread Gisle Aas

Paul Kulchenko <[EMAIL PROTECTED]> writes:

> Hi, Gisle!

Hi, Paul!

> --- Gisle Aas <[EMAIL PROTECTED]> wrote:
> > I my view it is a bug to put content containing chars with ord() > 255
> > in the the content of a HTTP::Request.  If you want UTF8 encoded stuff
> > you should put UTF8 encoded stuff in the content.  Don't expect perl
> > to magically guess.  You should use Encode::encode_utf8($str) or
> > something like it.
>
> I'm not sure I follow you. I do have my string utf8 encoded using
> Perl capabilities. What do you mean "is a bug to put content
> containing chars with ord() 255 in the the content of a
> HTTP::Request"? What should I do then if I have utf8 encoded string
> to send? I expect LWP will handle it properly on wire.

If you have a utf8 encoded string then none of the chars in the string
will have ord() > 255.  It is the kind of string that
Encode::encode_utf8() would produce.

What you want is LWP to deal with strings with the _internal_ UTF8
flag set.  In my view LWP can't guess what encoding to apply to
serialize that kind of string.  UTF-8 is not really a more obvious
choice than UTF-16.  If it happens to be an image that somehow got the
UTF8 flag set then any UTF-encoding would be wrong, as the string
should simply be utf8_downgraded to be ok again.

> > If there was an easy way I would like to add a
> >   sv_utf8_downgrade($req->content, 0);
> > to the LWP::Protocol code.  This would make requests with such
> > chars in them fail early.  
> I don't understand why they should be failed. What's wrong with this:
> 
> $utf8 = pack('U*', unpack('C*', $something_russian_latin1_encoded));
> $req = HTTP::Request
>   ->new(POST => $endpoint, HTTP::Headers->new, $utf8);
> $resp = LWP::UserAgent->new->request($req);
> 
> request won't be properly encoded in 5.6.1 and later unless you drop
> utf8 mark from $utf8. I do need to have utf8 encoding on wire. 
> 
> What do you expect me to do?

If you want the string UTF8 encoded, then say so explicitly:

 $req = HTTP::Request->new(POST => $endpoint);
 $req->content_type("text/plain; charset='utf8'");
 $req->content(Encode::encode_utf8($utf8));

> > > # drop UTF mark
> > > $str = pack('C0A*', $str) if length($str) != bytelength($str);
> > > 
> > > Ideally I would like to have it fixed in LWP::Protocol. btw, how
> > > quick is pack 'C0A*'?
> > 
> > It will certainly have to copy the string.  I think it would be
> > better to try to use one of functions the Encode module provides.
>
> I need it to work with all Perls starting 5.005. Encode wasn't
> available in 5.6.x, was it?

That is true.  How about something like this (untested);

   eval { require Encode; };
   if ($@) {
  # replacement
  *Encode::encode_utf8 = sub { pack('C0A*', shift) }; 
   }

Actually, I think pack is buggy if this works.  The A* really ought to
downgrade the string it packs and croak if this is not possible.
   
Regards,
Gisle



Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later

2001-09-04 Thread Gisle Aas

Paul Kulchenko <[EMAIL PROTECTED]> writes:

> [Problem]
> 1. LWP::Protocol::http and others use length() to calculate
> content-length and in Perl 5.6.1 and later length() calculates chars
> instead of bytes. It means that every request that has multibyte
> chars in it will have wrong content-length and other side will read
> less bytes than required. 

I my view it is a bug to put content containing chars with ord() > 255
in the the content of a HTTP::Request.  If you want UTF8 encoded stuff
you should put UTF8 encoded stuff in the content.  Don't expect perl
to magically guess.  You should use Encode::encode_utf8($str) or
something like it.

If there was an easy way I would like to add a

  sv_utf8_downgrade($req->content, 0);

to the LWP::Protocol code.  This would make requests with such chars
in them fail early.  I think the write call on the socket ought to do
the downgrade and croaking for me though.

> 2. LWP::Protocol::http::request overwrites Content-length header even
> if application properly specifies it. It's easy to fix (below), but
> it doesn't help much, because length() is used in syswrite/sysread
> calls to calculate size. See problem 1.
> 
> [Solution]
> Problem 2 is easy to fix:
> 
>   $h->header('Content-Length' => length $$cont_ref)
>   if defined($$cont_ref) && length($$cont_ref);
> 
> should be
> 
>   $h->header('Content-Length' => length $$cont_ref)
>   if !defined($h->header('Content-Length')) &&
>defined($$cont_ref) && length($$cont_ref);

I think overriding Content-Length is the right thing to do.

> Problem 1 is more complex. 'use bytes' is lexically scoped, hence
> doesn't help outside of LWP::Protocol::http. eval "use bytes"; can be
> used to fix it inside LWP::Protocol::http and others.
> 
> On application level the only solution I can come up with is this:
> 
> BEGIN { 
>   sub bytelength; 
>   eval ( eval('use bytes; 1') # 5.6.0 and later?
> ? 'sub bytelength { use bytes; length(@_ ? $_[0] : $_) }; 1'
> : 'sub bytelength { length(@_ ? $_[0] : $_) }; 1' 
>   ) or die;
> }
> 
> # drop UTF mark
> $str = pack('C0A*', $str) if length($str) != bytelength($str);
> 
> Ideally I would like to have it fixed in LWP::Protocol. btw, how
> quick is pack 'C0A*'?

It will certainly have to copy the string.  I think it would be better
to try to use one of functions the Encode module provides.

Regards,
Gisle