RE: RFC: WWW::Mechanize::Compress or LWP patch?
Gisle, The more I'm thinking about it, it's not appropriate for LWP. As far as I can think of, this would be the first and only instance of LWP modifying content that it receives before passing it back to the caller. I'm not sure that's a direction we should go. I agree. There are lots of questions to answer about what should exactly happen in this case with regard to various Content-* headers. Content-Length is obvious, but there are also headers like Content-MD5. How about partial content with Content-Range? Like I said in reply to David's message, I would hope that we wouldn't have to modify the original response at all. Also, there needs to be something selectable for the users who happen to have Compress::Zlib but don't want to get compressed data for whatever reason. It would certainly not happen by default. If you download a .tar.gz file you don't want to end up with a .tar file just because you happened to have Compress::Zlib installed. Okay, but, wouldn't the tar.gz file be a tar.gz.gz file if it were content-encoded (gzip)? Thus uncompressing it would give you back the tar.gz file. It was always my understanding that if content-encoding has a value, then doing the reverse will give you back the original file, regardless of any original compression. -Brian PS: (Whoops, I originally only sent this to Gisle) http://www.gordano.com - Messaging for educators.
Re: RFC: WWW::Mechanize::Compress or LWP patch?
---BeginMessage--- Brian Cassidy [EMAIL PROTECTED] writes: Also, there needs to be something selectable for the users who happen to have Compress::Zlib but don't want to get compressed data for whatever reason. It would certainly not happen by default. If you download a .tar.gz file you don't want to end up with a .tar file just because you happened to have Compress::Zlib installed. Okay, but, wouldn't the tar.gz file be a tar.gz.gz file if it were content-encoded (gzip)? Thus uncompressing it would give you back the tar.gz file. No. A server does the right thing if it marks a gziped file tar file with Content-Encoding: gzip. Some examples: [EMAIL PROTECTED] gisle]$ HEAD http://cpan.org/src/latest.tar.gz 200 OK Connection: close Date: Tue, 02 Dec 2003 20:00:08 GMT Accept-Ranges: bytes ETag: 94cf4-b585df-adad8740 Server: Apache/2.0.47 (Unix) DAV/2 Content-Encoding: x-gzip Content-Length: 11896287 Content-Type: application/x-tar Last-Modified: Wed, 05 Nov 2003 23:36:21 GMT Client-Date: Tue, 02 Dec 2003 20:00:08 GMT Client-Peer: 63.251.223.172:80 Client-Response-Num: 1 [EMAIL PROTECTED] gisle]$ HEAD ftp://ftp.cpan.org/pub/CPAN/src/latest.tar.gz 200 OK Server: (vsFTPd 1.1.3) Content-Encoding: gzip Content-Length: 11896287 Content-Type: application/x-tar Last-Modified: Wed, 05 Nov 2003 23:36:21 GMT Client-Date: Tue, 02 Dec 2003 20:00:54 GMT Client-Request-Num: 1 It was always my understanding that if content-encoding has a value, then doing the reverse will give you back the original file, regardless of any original compression. In HTTP there is not really any concept of an original file. There is just an entity (aka content) that is described with various Content-* headers. A Content-Encoding header just says that there is some transform needed on the content before you obtain the media type denoted by the Content-Type header. Regards, Gisle ---End Message---
RE: RFC: WWW::Mechanize::Compress or LWP patch?
Gisle, Doh. Way to ruin my day! :) So, are there any proposed work-arounds, or is this completely useless now? -Brian -Original Message- No. A server does the right thing if it marks a gziped file tar file with Content-Encoding: gzip. Some examples: http://www.gordano.com - Messaging for educators.
Re: RFC: WWW::Mechanize::Compress or LWP patch?
David Carter [EMAIL PROTECTED] writes: My understanding had always been that content-encoding (when talking about compression) is in practical terms no different than transfer-encoding. LWP already handles transfer-encoding (gzip or deflate), so what's the big deal about it also handling content-encoding compression in a transparent manner? Tranfer-Encoding and Content-Encoding works at different levels in the HTTP protocol. It makes perfect sense to handle Transfer-Encoding transparantly in a client library. It does not make sense to try to hide Content-Encoding in the same way. My suggestion would be to make it the default to handle it transparently, but provide an option to turn it off if someone needs access to the raw datastream. All GUI browsers just do it - the user doesn't have to be concerned with either content-encoding or transfer-encoding. I disagree. LWP is not a GUI browser and should not hide content-encoding by default. If you have a file in .tar.gz format, the web server should NOT return a content-encoding: gzip header. Sure it should. Especially if the Content-Type header describe the type of document you end up with after you 'gunzip' it. This would incur redundant processing costs on the server the client, attempting to re-compress an already compressed file for little or no gain. Instead, the server would send an appropriate mime type indicating to the client that this is a compressed archive file (usually handled in a GUI client by presenting a file download dialog box). I disagree here, but I'm sure practice differ among servers. Apache seems to serve .tar.gz files as: Content-Type: application/x-tar Content-Encoding: x-gzip and I think that is exactly as is should be. It may not be what the RFCs originally intended, but modern web server implementations of on-the-fly compression in my experience always use content-encoding rather than transfer-encoding. Could it have something to do with what MSIE implements? I've written a server-side plug-in to do this on the Netscape/iPlanet web server, and have done fairly extensive research on what's out there in Apache, etc. I'm not opposed to adding stuff to LWP that let you undo Content-Encoding, but it needs to be enabled explictly to make it backwards compatible. LWP currently has code that tries to parse the head section of text/html documents to extract headers, meta and the base. This code fails when the document is compressed, so there is actually need for undo-content-encoding-support in the LWP core. I think most users would be served well with an option that simply tells LWP to try to undo content-encoding for any text/* content, but I'm also thinking that LWP should have some kind of generic filtering mechanism similar to Perl's IO layers. That should be able to deal with content-encoding and might even turn the content into Unicode strings and similar based on the charset parameter. Regards, Gisle
Re: RFC: WWW::Mechanize::Compress or LWP patch?
Brian Cassidy [EMAIL PROTECTED] writes: So, are there any proposed work-arounds, or is this completely useless now? The decompresser just needs to be smarter about when to kick in. For a GUI browser it would kick in if you decide to display the document. This can be determined after looking at the content-type, but I'm sure they do all kinds of stuff to try to second guess the server (like looking at the URL suffix and peeking at the first block of the document content). For WWW::Mechanize you could for instance try to automatically undo content-encoding once code actually try to to match against content, parse forms etc. Regards, Gisle
Re: RFC: WWW::Mechanize::Compress or LWP patch?
From the redirects thread also on this list, I find the browser test suite site Brian mentioned to be relevant to this dicusssion. Shouldn't LWP be able to handle this URL automagically? (Whether by default or not is, as always, open to discussion.) http://diveintomark.org/tests/client/http/200_gzip.xml IE just does it. Regards, David On 03 Dec 2003 05:45:19 -0800, Gisle Aas wrote: Brian Cassidy [EMAIL PROTECTED] writes: So, are there any proposed work-arounds, or is this completely useless now? The decompresser just needs to be smarter about when to kick in. For a GUI browser it would kick in if you decide to display the document. This can be determined after looking at the content-type, but I'm sure they do all kinds of stuff to try to second guess the server (like looking at the URL suffix and peeking at the first block of the document content). For WWW::Mechanize you could for instance try to automatically undo content-encoding once code actually try to to match against content, parse forms etc. Regards, Gisle
Re: RFC: WWW::Mechanize::Compress or LWP patch?
[EMAIL PROTECTED] writes: Shouldn't LWP be able to handle this URL automagically? Yes, if we can find an API everybody is happy with. http://diveintomark.org/tests/client/http/200_gzip.xml IE just does it. So does Mozilla. Konqueror suggest saving or opening the file in an external app, but the file saved or given to an external app is still gzipped. Regards, Gisle
Re: RFC: WWW::Mechanize::Compress or LWP patch?
On Wed, 3 Dec 2003, Gisle Aas wrote: [EMAIL PROTECTED] writes: [...] http://diveintomark.org/tests/client/http/200_gzip.xml IE just does it. [...] Konqueror suggest saving or opening the file in an external app, but the file saved or given to an external app is still gzipped. Not in KDE 3.2: it decompresses automatically, so when you save or open with KWrite, it's just 200_gzip.xml. John
Re: RFC: WWW::Mechanize::Compress or LWP patch?
On Wed, 3 Dec 2003, John J Lee wrote: [...] Not in KDE 3.2: it decompresses automatically, so when you save or open with KWrite, it's just 200_gzip.xml. ...and I'd take a guess that's because Safari (Apple's browser based on Konqueror) does the same, because 3.2 apparently includes a lot of changes merged back from Safari. John
Re: RFC: WWW::Mechanize::Compress or LWP patch?
Don't you also need to adjust the content-length header to match the new (uncompressed) content? --- David Carter [EMAIL PROTECTED] On Tue, 2 Dec 2003 13:37:06 -0400, Brian Cassidy wrote: Hi All, Today I cooked up a little bit of code [1] to give WWW::Mechanize the ability to handle compressed content (gzip and deflate). I forwarded it over to Andy for his comments and he thought that maybe it would be best if this code was adapted for use directly in LWP. I would tend to agree since everything is handled behind the scenes. + If Compress::Zlib isn't available, we forgo the Accept_encoding headers + It makes sure the response is compressed before trying to uncompress The only (freak) edge case would be if you get a response that was encoded and Compress::Zlib isn't available (thus it croak()s). There could be considerable bandwidth savings if LWP users were able to get compressed content by default (without even knowing it :). Although I guess therein hides the problem where we force people to accept compressed content. Comments? -Brian Cassidy ( [EMAIL PROTECTED] ) [1] package WWW::Mechanize::Compress; use strict; use warnings FATAL = 'all'; use vars qw( $VERSION $HAS_ZLIB ); $VERSION = '0.01'; use base qw( WWW::Mechanize ); use Carp qw( carp croak ); BEGIN { $HAS_ZLIB = 1 if defined eval require Compress::Zlib;; } sub _make_request { my $self= shift; my $request = shift; $request-header( Accept_encoding = 'gzip; deflate' ) if $HAS_ZLIB; my $response = $self-SUPER::_make_request( $request, @_ ); if ( my $encoding = $response-header( 'Content-Encoding' ) ) { croak 'Compress::Zlib not found. Cannot uncompress content.' unless $HAS_ZLIB; $self-{ uncompressed_content } = Compress::Zlib::memGunzip( $response-content ) if $encoding =~ /gzip/i; $self-{ uncompressed_content } = Compress::Zlib::uncompress( $response-content ) if $encoding =~ /deflate/i; } return $response; } sub content { my $self = shift; return $self-{ uncompressed_content } || $self-{ content }; } 1; http://www.gordano.com - Messaging for educators.
RE: RFC: WWW::Mechanize::Compress or LWP patch?
David, In my WWW::Mech sub-class, I'm not actually modifying the response. WWW::Mech extracts the content out into its own variable, thus I do the same for the uncompressed content. If you happen to ask for WWW::Mech's response object ($a-response()) then you'll get the compressed data and you'll have to deal with that on your own. I'm not really sure how you'd have to handle it in LWP at this point. I'd prefer this be adapted in such a way that you wouldn't have to mess with the original response data. -Brian -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 02, 2003 1:57 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: RFC: WWW::Mechanize::Compress or LWP patch? Don't you also need to adjust the content-length header to match the new (uncompressed) content? --- David Carter [EMAIL PROTECTED] On Tue, 2 Dec 2003 13:37:06 -0400, Brian Cassidy wrote: Hi All, Today I cooked up a little bit of code [1] to give WWW::Mechanize the ability to handle compressed content (gzip and deflate). I forwarded it over to Andy for his comments and he thought that maybe it would be best if this code was adapted for use directly in LWP. I would tend to agree since everything is handled behind the scenes. + If Compress::Zlib isn't available, we forgo the Accept_encoding headers + It makes sure the response is compressed before trying to uncompress The only (freak) edge case would be if you get a response that was encoded and Compress::Zlib isn't available (thus it croak()s). There could be considerable bandwidth savings if LWP users were able to get compressed content by default (without even knowing it :). Although I guess therein hides the problem where we force people to accept compressed content. Comments? -Brian Cassidy ( [EMAIL PROTECTED] ) [1] package WWW::Mechanize::Compress; use strict; use warnings FATAL = 'all'; use vars qw( $VERSION $HAS_ZLIB ); $VERSION = '0.01'; use base qw( WWW::Mechanize ); use Carp qw( carp croak ); BEGIN { $HAS_ZLIB = 1 if defined eval require Compress::Zlib;; } sub _make_request { my $self= shift; my $request = shift; $request-header( Accept_encoding = 'gzip; deflate' ) if $HAS_ZLIB; my $response = $self-SUPER::_make_request( $request, @_ ); if ( my $encoding = $response-header( 'Content-Encoding' ) ) { croak 'Compress::Zlib not found. Cannot uncompress content.' unless $HAS_ZLIB; $self-{ uncompressed_content } = Compress::Zlib::memGunzip( $response-content ) if $encoding =~ /gzip/i; $self-{ uncompressed_content } = Compress::Zlib::uncompress( $response-content ) if $encoding =~ /deflate/i; } return $response; } sub content { my $self = shift; return $self-{ uncompressed_content } || $self-{ content }; } 1; http://www.gordano.com - Messaging for educators. http://www.gordano.com - Messaging for educators.
RE: RFC: WWW::Mechanize::Compress or LWP patch?
Brian - Had I read your code more closely, I would have seen this! I agree with you that handling this at the LWP layer makes more sense than handling it in Mechanize. --- David Carter [EMAIL PROTECTED] On Tue, 2 Dec 2003 14:07:43 -0400, Brian Cassidy wrote: In my WWW::Mech sub-class, I'm not actually modifying the response. WWW::Mech extracts the content out into its own variable, thus I do the same for the uncompressed content. If you happen to ask for WWW::Mech's response object ($a-response()) then you'll get the compressed data and you'll have to deal with that on your own. I'm not really sure how you'd have to handle it in LWP at this point. I'd prefer this be adapted in such a way that you wouldn't have to mess with the original response data.
Re: RFC: WWW::Mechanize::Compress or LWP patch?
Brian - Had I read your code more closely, I would have seen this! I agree with you that handling this at the LWP layer makes more sense than handling it in Mechanize. The more I'm thinking about it, it's not appropriate for LWP. As far as I can think of, this would be the first and only instance of LWP modifying content that it receives before passing it back to the caller. I'm not sure that's a direction we should go. Also, there needs to be something selectable for the users who happen to have Compress::Zlib but don't want to get compressed data for whatever reason. xoa -- Andy Lester = [EMAIL PROTECTED] = www.petdance.com = AIM:petdance
Re: RFC: WWW::Mechanize::Compress or LWP patch?
Andy Lester [EMAIL PROTECTED] writes: Brian - Had I read your code more closely, I would have seen this! I agree with you that handling this at the LWP layer makes more sense than handling it in Mechanize. The more I'm thinking about it, it's not appropriate for LWP. As far as I can think of, this would be the first and only instance of LWP modifying content that it receives before passing it back to the caller. I'm not sure that's a direction we should go. I agree. There are lots of questions to answer about what should exactly happen in this case with regard to various Content-* headers. Content-Length is obvious, but there are also headers like Content-MD5. How about partial content with Content-Range? But, it does make sense to have something in LWP that makes decompression of content easy. It could for instance just a library of LWP compatible content callback functions that do decoding on the fly. Also note that LWP already do transparent decompression at the Content-Transfer-Encoding level, so if you have a server that can compress on-the-fly it should just work. This is different in that it does not modifiy the actual payload of the message. Also, there needs to be something selectable for the users who happen to have Compress::Zlib but don't want to get compressed data for whatever reason. It would certainly not happen by default. If you download a .tar.gz file you don't want to end up with a .tar file just because you happened to have Compress::Zlib installed. --Gisle