RE: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Brian Cassidy
Gisle,

  The more I'm thinking about it, it's not appropriate for LWP.  As far as
  I can think of, this would be the first and only instance of LWP
  modifying content that it receives before passing it back to the caller.
  I'm not sure that's a direction we should go.
 
 I agree.  There are lots of questions to answer about what should
 exactly happen in this case with regard to various Content-* headers.
 Content-Length is obvious, but there are also headers like
 Content-MD5.  How about partial content with Content-Range?

Like I said in reply to David's message, I would hope that we wouldn't have
to modify the original response at all.

  Also, there needs to be something selectable for the users who happen to
  have Compress::Zlib but don't want to get compressed data for whatever
  reason.
 
 It would certainly not happen by default.  If you download a .tar.gz
 file you don't want to end up with a .tar file just because you
 happened to have Compress::Zlib installed.

Okay, but, wouldn't the tar.gz file be a tar.gz.gz file if it were
content-encoded (gzip)? Thus uncompressing it would give you back the tar.gz
file. 

It was always my understanding that if content-encoding has a value, then
doing the reverse will give you back the original file, regardless of any
original compression.

-Brian

PS: (Whoops, I originally only sent this to Gisle)


http://www.gordano.com - Messaging for educators.


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Gisle Aas
---BeginMessage---
Brian Cassidy [EMAIL PROTECTED] writes:

   Also, there needs to be something selectable for the users who happen to
   have Compress::Zlib but don't want to get compressed data for whatever
   reason.
  
  It would certainly not happen by default.  If you download a .tar.gz
  file you don't want to end up with a .tar file just because you
  happened to have Compress::Zlib installed.
 
 Okay, but, wouldn't the tar.gz file be a tar.gz.gz file if it were
 content-encoded (gzip)? Thus uncompressing it would give you back the tar.gz
 file. 

No.  A server does the right thing if it marks a gziped file tar file
with Content-Encoding: gzip.  Some examples:

[EMAIL PROTECTED] gisle]$ HEAD http://cpan.org/src/latest.tar.gz
200 OK
Connection: close
Date: Tue, 02 Dec 2003 20:00:08 GMT
Accept-Ranges: bytes
ETag: 94cf4-b585df-adad8740
Server: Apache/2.0.47 (Unix) DAV/2
Content-Encoding: x-gzip
Content-Length: 11896287
Content-Type: application/x-tar
Last-Modified: Wed, 05 Nov 2003 23:36:21 GMT
Client-Date: Tue, 02 Dec 2003 20:00:08 GMT
Client-Peer: 63.251.223.172:80
Client-Response-Num: 1

[EMAIL PROTECTED] gisle]$ HEAD ftp://ftp.cpan.org/pub/CPAN/src/latest.tar.gz
200 OK
Server: (vsFTPd 1.1.3)
Content-Encoding: gzip
Content-Length: 11896287
Content-Type: application/x-tar
Last-Modified: Wed, 05 Nov 2003 23:36:21 GMT
Client-Date: Tue, 02 Dec 2003 20:00:54 GMT
Client-Request-Num: 1

 It was always my understanding that if content-encoding has a value, then
 doing the reverse will give you back the original file, regardless of any
 original compression.

In HTTP there is not really any concept of an original file.  There is
just an entity (aka content) that is described with various Content-*
headers.  A Content-Encoding header just says that there is some
transform needed on the content before you obtain the media type
denoted by the Content-Type header.

Regards,
Gisle

---End Message---


RE: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Brian Cassidy
Gisle,

Doh. Way to ruin my day! :)

So, are there any proposed work-arounds, or is this completely useless now?

-Brian

 -Original Message-

 No.  A server does the right thing if it marks a gziped file tar file
 with Content-Encoding: gzip.  Some examples:


http://www.gordano.com - Messaging for educators.


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Gisle Aas
David Carter [EMAIL PROTECTED] writes:

 My understanding had always been that content-encoding (when talking about
 compression) is in practical terms no different than transfer-encoding. LWP
 already handles transfer-encoding (gzip or deflate), so what's the big deal
 about it also handling content-encoding compression in a transparent manner?

Tranfer-Encoding and Content-Encoding works at different levels in the
HTTP protocol.  It makes perfect sense to handle Transfer-Encoding
transparantly in a client library.  It does not make sense to try to
hide Content-Encoding in the same way.

 My suggestion would be to make it the default to handle it transparently,
 but provide an option to turn it off if someone needs access to the raw
 datastream. All GUI browsers just do it - the user doesn't have to be
 concerned with either content-encoding or transfer-encoding.

I disagree.  LWP is not a GUI browser and should not hide
content-encoding by default.

 If you have a file in .tar.gz format, the web server should NOT return a
 content-encoding: gzip header.

Sure it should.  Especially if the Content-Type header describe the
type of document you end up with after you 'gunzip' it.

  This would incur redundant processing costs
 on the server  the client, attempting to re-compress an already compressed
 file for little or no gain. Instead, the server would send an appropriate
 mime type indicating to the client that this is a compressed archive file
 (usually handled in a GUI client by presenting a file download dialog box). 

I disagree here, but I'm sure practice differ among servers.  Apache
seems to serve .tar.gz files as:

   Content-Type: application/x-tar
   Content-Encoding: x-gzip

and I think that is exactly as is should be.

 It may not be what the RFCs originally intended, but modern web server
 implementations of on-the-fly compression in my experience always use
 content-encoding rather than transfer-encoding.

Could it have something to do with what MSIE implements?

  I've written a server-side
 plug-in to do this on the Netscape/iPlanet web server, and have done fairly
 extensive research on what's out there in Apache, etc. 

I'm not opposed to adding stuff to LWP that let you undo
Content-Encoding, but it needs to be enabled explictly to make it
backwards compatible.

LWP currently has code that tries to parse the head section of
text/html documents to extract headers, meta and the base.  This code
fails when the document is compressed, so there is actually need for
undo-content-encoding-support in the LWP core.

I think most users would be served well with an option that simply
tells LWP to try to undo content-encoding for any text/* content, but
I'm also thinking that LWP should have some kind of generic filtering
mechanism similar to Perl's IO layers.  That should be able to deal
with content-encoding and might even turn the content into Unicode
strings and similar based on the charset parameter.

Regards,
Gisle


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Gisle Aas
Brian Cassidy [EMAIL PROTECTED] writes:

 So, are there any proposed work-arounds, or is this completely useless now?

The decompresser just needs to be smarter about when to kick in.  For
a GUI browser it would kick in if you decide to display the document.
This can be determined after looking at the content-type, but I'm sure
they do all kinds of stuff to try to second guess the server (like
looking at the URL suffix and peeking at the first block of the
document content).  For WWW::Mechanize you could for instance try to
automatically undo content-encoding once code actually try to to match
against content, parse forms etc.

Regards,
Gisle


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread david
From the redirects thread also on this list, I find the
browser test suite site Brian mentioned to be relevant
to this dicusssion. 

Shouldn't LWP be able to handle this URL automagically?
(Whether by default or not is, as always, open to
discussion.)

http://diveintomark.org/tests/client/http/200_gzip.xml

IE just does it.

Regards,
David


On 03 Dec 2003 05:45:19 -0800, Gisle Aas wrote:

 
 Brian Cassidy [EMAIL PROTECTED] writes:
 
  So, are there any proposed work-arounds, or is this
 completely useless now?
 
 The decompresser just needs to be smarter about when
to
 kick in.  For
 a GUI browser it would kick in if you decide to
display
 the document.
 This can be determined after looking at the
 content-type, but I'm sure
 they do all kinds of stuff to try to second guess the
 server (like
 looking at the URL suffix and peeking at the first
 block of the
 document content).  For WWW::Mechanize you could for
 instance try to
 automatically undo content-encoding once code actually
 try to to match
 against content, parse forms etc.
 
 Regards,
 Gisle


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread Gisle Aas
[EMAIL PROTECTED] writes:

 Shouldn't LWP be able to handle this URL automagically?

Yes, if we can find an API everybody is happy with.

 http://diveintomark.org/tests/client/http/200_gzip.xml
 
 IE just does it.

So does Mozilla.  Konqueror suggest saving or opening the file in an
external app, but the file saved or given to an external app is still
gzipped.

Regards,
Gisle


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread John J Lee
On Wed, 3 Dec 2003, Gisle Aas wrote:
 [EMAIL PROTECTED] writes:
[...]
  http://diveintomark.org/tests/client/http/200_gzip.xml
 
  IE just does it.

[...]
 Konqueror suggest saving or opening the file in an
 external app, but the file saved or given to an external app is still
 gzipped.

Not in KDE 3.2: it decompresses automatically, so when you save or open
with KWrite, it's just 200_gzip.xml.


John


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-03 Thread John J Lee
On Wed, 3 Dec 2003, John J Lee wrote:
[...]
 Not in KDE 3.2: it decompresses automatically, so when you save or open
 with KWrite, it's just 200_gzip.xml.

...and I'd take a guess that's because Safari (Apple's browser based on
Konqueror) does the same, because 3.2 apparently includes a lot of changes
merged back from Safari.


John


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-02 Thread david
Don't you also need to adjust the content-length header
to match the new (uncompressed) content? 

---
David Carter
[EMAIL PROTECTED]




On Tue, 2 Dec 2003 13:37:06 -0400, Brian Cassidy
wrote:

 
 Hi All,
 
 Today I cooked up a little bit of code [1] to give
 WWW::Mechanize the
 ability to handle compressed content (gzip and
 deflate). I forwarded it over
 to Andy for his comments and he thought that maybe it
 would be best if this
 code was adapted for use directly in LWP.
 
 I would tend to agree since everything is handled
 behind the scenes.
 + If Compress::Zlib isn't available, we forgo the
 Accept_encoding headers
 + It makes sure the response is compressed before
 trying to uncompress
 
 The only (freak) edge case would be if you get a
 response that was encoded
 and Compress::Zlib isn't available (thus it
croak()s). 
 
 There could be considerable bandwidth savings if LWP
 users were able to get
 compressed content by default (without even knowing it
 :). Although I guess
 therein hides the problem where we force people to
 accept compressed
 content.
 
 Comments?
 
 -Brian Cassidy ( [EMAIL PROTECTED] )
 
 [1]
 
 package WWW::Mechanize::Compress;
 
 use strict;
 use warnings FATAL = 'all';
 use vars qw( $VERSION $HAS_ZLIB );
 $VERSION = '0.01';
 
 use base qw( WWW::Mechanize );
 use Carp qw( carp croak );
 
 BEGIN {
   $HAS_ZLIB = 1 if defined eval require
 Compress::Zlib;;
 }
 
 sub _make_request {
   my $self= shift;
   my $request = shift;
 
   $request-header( Accept_encoding = 'gzip; deflate'
)
 if $HAS_ZLIB;
 
   my $response = $self-SUPER::_make_request( $request,
 @_ );
 
   if ( my $encoding = $response-header(
 'Content-Encoding' ) ) {
   croak 'Compress::Zlib not found. Cannot uncompress
 content.'
 unless $HAS_ZLIB;
 
   $self-{ uncompressed_content } =
 Compress::Zlib::memGunzip(
 $response-content ) if $encoding =~ /gzip/i;
   $self-{ uncompressed_content } =
 Compress::Zlib::uncompress( $response-content ) if
 $encoding =~ /deflate/i;
   }
 
   return $response;
 }
 
 sub content {
   my $self = shift;
 
   return $self-{ uncompressed_content } || $self-{
 content };
 }
 
 1;
 
 
 http://www.gordano.com - Messaging for educators.


RE: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-02 Thread Brian Cassidy
David,

In my WWW::Mech sub-class, I'm not actually modifying the response.

WWW::Mech extracts the content out into its own variable, thus I do the same
for the uncompressed content. If you happen to ask for WWW::Mech's response
object ($a-response()) then you'll get the compressed data and you'll have
to deal with that on your own.

I'm not really sure how you'd have to handle it in LWP at this point. I'd
prefer this be adapted in such a way that you wouldn't have to mess with the
original response data.

-Brian 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, December 02, 2003 1:57 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: RFC: WWW::Mechanize::Compress or LWP patch?

Don't you also need to adjust the content-length header
to match the new (uncompressed) content? 

---
David Carter
[EMAIL PROTECTED]




On Tue, 2 Dec 2003 13:37:06 -0400, Brian Cassidy
wrote:

 
 Hi All,
 
 Today I cooked up a little bit of code [1] to give
 WWW::Mechanize the
 ability to handle compressed content (gzip and
 deflate). I forwarded it over
 to Andy for his comments and he thought that maybe it
 would be best if this
 code was adapted for use directly in LWP.
 
 I would tend to agree since everything is handled
 behind the scenes.
 + If Compress::Zlib isn't available, we forgo the
 Accept_encoding headers
 + It makes sure the response is compressed before
 trying to uncompress
 
 The only (freak) edge case would be if you get a
 response that was encoded
 and Compress::Zlib isn't available (thus it
croak()s). 
 
 There could be considerable bandwidth savings if LWP
 users were able to get
 compressed content by default (without even knowing it
 :). Although I guess
 therein hides the problem where we force people to
 accept compressed
 content.
 
 Comments?
 
 -Brian Cassidy ( [EMAIL PROTECTED] )
 
 [1]
 
 package WWW::Mechanize::Compress;
 
 use strict;
 use warnings FATAL = 'all';
 use vars qw( $VERSION $HAS_ZLIB );
 $VERSION = '0.01';
 
 use base qw( WWW::Mechanize );
 use Carp qw( carp croak );
 
 BEGIN {
   $HAS_ZLIB = 1 if defined eval require
 Compress::Zlib;;
 }
 
 sub _make_request {
   my $self= shift;
   my $request = shift;
 
   $request-header( Accept_encoding = 'gzip; deflate'
)
 if $HAS_ZLIB;
 
   my $response = $self-SUPER::_make_request( $request,
 @_ );
 
   if ( my $encoding = $response-header(
 'Content-Encoding' ) ) {
   croak 'Compress::Zlib not found. Cannot uncompress
 content.'
 unless $HAS_ZLIB;
 
   $self-{ uncompressed_content } =
 Compress::Zlib::memGunzip(
 $response-content ) if $encoding =~ /gzip/i;
   $self-{ uncompressed_content } =
 Compress::Zlib::uncompress( $response-content ) if
 $encoding =~ /deflate/i;
   }
 
   return $response;
 }
 
 sub content {
   my $self = shift;
 
   return $self-{ uncompressed_content } || $self-{
 content };
 }
 
 1;
 
 
 http://www.gordano.com - Messaging for educators.


http://www.gordano.com - Messaging for educators.


RE: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-02 Thread david
Brian - Had I read your code more closely, I would have
seen this! I agree with you that handling this at the
LWP layer makes more sense than handling it in
Mechanize. 

---
David Carter
[EMAIL PROTECTED]

On Tue, 2 Dec 2003 14:07:43 -0400, Brian Cassidy
wrote:
 In my WWW::Mech sub-class, I'm not actually modifying
 the response.
 
 WWW::Mech extracts the content out into its own
 variable, thus I do the same
 for the uncompressed content. If you happen to ask for
 WWW::Mech's response
 object ($a-response()) then you'll get the compressed
 data and you'll have
 to deal with that on your own.
 
 I'm not really sure how you'd have to handle it in LWP
 at this point. I'd
 prefer this be adapted in such a way that you wouldn't
 have to mess with the
 original response data.


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-02 Thread Andy Lester
 Brian - Had I read your code more closely, I would have
 seen this! I agree with you that handling this at the
 LWP layer makes more sense than handling it in
 Mechanize. 

The more I'm thinking about it, it's not appropriate for LWP.  As far as
I can think of, this would be the first and only instance of LWP
modifying content that it receives before passing it back to the caller.
I'm not sure that's a direction we should go.

Also, there needs to be something selectable for the users who happen to
have Compress::Zlib but don't want to get compressed data for whatever
reason.

xoa

-- 
Andy Lester = [EMAIL PROTECTED] = www.petdance.com = AIM:petdance


Re: RFC: WWW::Mechanize::Compress or LWP patch?

2003-12-02 Thread Gisle Aas
Andy Lester [EMAIL PROTECTED] writes:

  Brian - Had I read your code more closely, I would have
  seen this! I agree with you that handling this at the
  LWP layer makes more sense than handling it in
  Mechanize. 
 
 The more I'm thinking about it, it's not appropriate for LWP.  As far as
 I can think of, this would be the first and only instance of LWP
 modifying content that it receives before passing it back to the caller.
 I'm not sure that's a direction we should go.

I agree.  There are lots of questions to answer about what should
exactly happen in this case with regard to various Content-* headers.
Content-Length is obvious, but there are also headers like
Content-MD5.  How about partial content with Content-Range?

But, it does make sense to have something in LWP that makes
decompression of content easy.  It could for instance just a library
of LWP compatible content callback functions that do decoding on the
fly.

Also note that LWP already do transparent decompression at the
Content-Transfer-Encoding level, so if you have a server that can
compress on-the-fly it should just work.  This is different in that it
does not modifiy the actual payload of the message.

 Also, there needs to be something selectable for the users who happen to
 have Compress::Zlib but don't want to get compressed data for whatever
 reason.

It would certainly not happen by default.  If you download a .tar.gz
file you don't want to end up with a .tar file just because you
happened to have Compress::Zlib installed.

--Gisle