Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later
Paul Kulchenko <[EMAIL PROTECTED]> writes: > Hi, Gisle! > > > Correct. The best fix is to not provide such strings. syswrite() > > refuse to deal with them too. > > That effectively means that I can't send utf8 encoded strings over > the wire (5.7.x croaks and 5.6.x doesn't work correctly). I need to > convert them first, right? Right. > That's where we started. I can definitely > do it in my application/module, but I would expect that LWP::Protocol > will take care about it. Am I wrong? In my view, yes. > > There is no obvious way to deal with strings containing chars > > outside 0..255 as _binary strings_. > > Ok, the question is WHO should convert them? App? LWP::UserAgent? > LWP::Protocol? IO::Socket? Noone? I would say the App. An alternative could be to have the content() method in HTTP::Message auto convert when it finds high chars. The problem with this is that it does not have enough information for selecting what encoding to use. From what I understand, you want it to just use UTF8 (because that happen to be the current internal representation of wide chars). Regards, Gisle
Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later
Paul Kulchenko <[EMAIL PROTECTED]> writes: > --- Gisle Aas <[EMAIL PROTECTED]> wrote: > > What you want is LWP to deal with strings with the _internal_ UTF8 > > flag set. In my view LWP can't guess what encoding to apply to > Quite opposite. I want that LWP deals with string as binary strings > regardless of used encoding. There is no obvious way to deal with strings containing chars outside 0..255 as _binary strings_. > length() used there deals with string as > set of chars instead of bytes, thus making impossible for LWP to > specify proper content-length and call sysread/syswrite with proper > size. Correct. The best fix is to not provide such strings. syswrite() refuse to deal with them too. > > serialize that kind of string. UTF-8 is not really a more obvious > > choice than UTF-16. If it happens to be an image that somehow got > > the > > UTF8 flag set then any UTF-encoding would be wrong, as the string > > should simply be utf8_downgraded to be ok again. > Absolutely. That's exactly what should be done imho. String should be > downgraded to set of bytes. But only when all chars are 0..255. > > > What do you expect me to do? > > If you want the string UTF8 encoded, then say so explicitly: > > $req = HTTP::Request->new(POST => $endpoint); > > $req->content_type("text/plain; charset='utf8'"); > > $req->content(Encode::encode_utf8($utf8)); > But that's precisely what I do. Content is being specified > incorrectly because length() calculates chars on utf8-encoded > strings. That's where we started. You did not have the Encode::encode_utf8() call. If you do everything should be fine, and all chars in the content will be in range 0..255. Regards, Gisle
Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later
Nick Ing-Simmons <[EMAIL PROTECTED]> writes: > Gisle Aas <[EMAIL PROTECTED]> writes: > >Paul Kulchenko <[EMAIL PROTECTED]> writes: > > > >> [Problem] > >> 1. LWP::Protocol::http and others use length() to calculate > >> content-length and in Perl 5.6.1 and later length() calculates chars > >> instead of bytes. It means that every request that has multibyte > >> chars in it will have wrong content-length and other side will read > >> less bytes than required. > > > >I my view it is a bug to put content containing chars with ord() > 255 > >in the the content of a HTTP::Request. If you want UTF8 encoded stuff > >you should put UTF8 encoded stuff in the content. Don't expect perl > >to magically guess. You should use Encode::encode_utf8($str) or > >something like it. > > > >If there was an easy way I would like to add a > > > > sv_utf8_downgrade($req->content, 0); > >utf8::downgrade($req->content, 0); > > for perl 5.7.* for large-enough * I guess I could do something like utf8::downgrade($req->content, 0) if defined &utf8::downgrade; then. Might want to add this to the HTTP::Message->content() method so it croaks as soon as you try to put wide characters in. n> > > >to the LWP::Protocol code. This would make requests with such chars > >in them fail early. I think the write call on the socket ought to do > >the downgrade and croaking for me though. > > Again I think 5.7.* branch should do that - it was certainly the intent > (it may only warn) I just verified that syswrite does indeed croak in bleedperl. This program: require LWP::UserAgent; my $ua = LWP::UserAgent->new; my $req = HTTP::Request->new(POST => 'http://localhost/test.cgi'); $req->content_type("text/plain"); $req->content(v200.300.400); my $res = $ua->request($req); print $res->as_string; prints: 500 (Internal Server Error) Wide character in syswrite Client-Date: Thu, 06 Sep 2001 20:11:41 GMT
Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later
Paul Kulchenko <[EMAIL PROTECTED]> writes: > Hi, Gisle! Hi, Paul! > --- Gisle Aas <[EMAIL PROTECTED]> wrote: > > I my view it is a bug to put content containing chars with ord() > 255 > > in the the content of a HTTP::Request. If you want UTF8 encoded stuff > > you should put UTF8 encoded stuff in the content. Don't expect perl > > to magically guess. You should use Encode::encode_utf8($str) or > > something like it. > > I'm not sure I follow you. I do have my string utf8 encoded using > Perl capabilities. What do you mean "is a bug to put content > containing chars with ord() 255 in the the content of a > HTTP::Request"? What should I do then if I have utf8 encoded string > to send? I expect LWP will handle it properly on wire. If you have a utf8 encoded string then none of the chars in the string will have ord() > 255. It is the kind of string that Encode::encode_utf8() would produce. What you want is LWP to deal with strings with the _internal_ UTF8 flag set. In my view LWP can't guess what encoding to apply to serialize that kind of string. UTF-8 is not really a more obvious choice than UTF-16. If it happens to be an image that somehow got the UTF8 flag set then any UTF-encoding would be wrong, as the string should simply be utf8_downgraded to be ok again. > > If there was an easy way I would like to add a > > sv_utf8_downgrade($req->content, 0); > > to the LWP::Protocol code. This would make requests with such > > chars in them fail early. > I don't understand why they should be failed. What's wrong with this: > > $utf8 = pack('U*', unpack('C*', $something_russian_latin1_encoded)); > $req = HTTP::Request > ->new(POST => $endpoint, HTTP::Headers->new, $utf8); > $resp = LWP::UserAgent->new->request($req); > > request won't be properly encoded in 5.6.1 and later unless you drop > utf8 mark from $utf8. I do need to have utf8 encoding on wire. > > What do you expect me to do? If you want the string UTF8 encoded, then say so explicitly: $req = HTTP::Request->new(POST => $endpoint); $req->content_type("text/plain; charset='utf8'"); $req->content(Encode::encode_utf8($utf8)); > > > # drop UTF mark > > > $str = pack('C0A*', $str) if length($str) != bytelength($str); > > > > > > Ideally I would like to have it fixed in LWP::Protocol. btw, how > > > quick is pack 'C0A*'? > > > > It will certainly have to copy the string. I think it would be > > better to try to use one of functions the Encode module provides. > > I need it to work with all Perls starting 5.005. Encode wasn't > available in 5.6.x, was it? That is true. How about something like this (untested); eval { require Encode; }; if ($@) { # replacement *Encode::encode_utf8 = sub { pack('C0A*', shift) }; } Actually, I think pack is buggy if this works. The A* really ought to downgrade the string it packs and croak if this is not possible. Regards, Gisle
Re: [Problem] with LWP, unicode/multibyte chars and Perl 5.6.1 and later
Paul Kulchenko <[EMAIL PROTECTED]> writes: > [Problem] > 1. LWP::Protocol::http and others use length() to calculate > content-length and in Perl 5.6.1 and later length() calculates chars > instead of bytes. It means that every request that has multibyte > chars in it will have wrong content-length and other side will read > less bytes than required. I my view it is a bug to put content containing chars with ord() > 255 in the the content of a HTTP::Request. If you want UTF8 encoded stuff you should put UTF8 encoded stuff in the content. Don't expect perl to magically guess. You should use Encode::encode_utf8($str) or something like it. If there was an easy way I would like to add a sv_utf8_downgrade($req->content, 0); to the LWP::Protocol code. This would make requests with such chars in them fail early. I think the write call on the socket ought to do the downgrade and croaking for me though. > 2. LWP::Protocol::http::request overwrites Content-length header even > if application properly specifies it. It's easy to fix (below), but > it doesn't help much, because length() is used in syswrite/sysread > calls to calculate size. See problem 1. > > [Solution] > Problem 2 is easy to fix: > > $h->header('Content-Length' => length $$cont_ref) > if defined($$cont_ref) && length($$cont_ref); > > should be > > $h->header('Content-Length' => length $$cont_ref) > if !defined($h->header('Content-Length')) && >defined($$cont_ref) && length($$cont_ref); I think overriding Content-Length is the right thing to do. > Problem 1 is more complex. 'use bytes' is lexically scoped, hence > doesn't help outside of LWP::Protocol::http. eval "use bytes"; can be > used to fix it inside LWP::Protocol::http and others. > > On application level the only solution I can come up with is this: > > BEGIN { > sub bytelength; > eval ( eval('use bytes; 1') # 5.6.0 and later? > ? 'sub bytelength { use bytes; length(@_ ? $_[0] : $_) }; 1' > : 'sub bytelength { length(@_ ? $_[0] : $_) }; 1' > ) or die; > } > > # drop UTF mark > $str = pack('C0A*', $str) if length($str) != bytelength($str); > > Ideally I would like to have it fixed in LWP::Protocol. btw, how > quick is pack 'C0A*'? It will certainly have to copy the string. I think it would be better to try to use one of functions the Encode module provides. Regards, Gisle