Re: Content disposition question

2007-12-10 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 Actually, the reason it is not enabled by default is that (1) it is
 broken in some respects that need addressing, and (2) as it is currently
 implemented, it involves a significant amount of extra traffic,
 regardless of whether the remote end actually ends up using
 Content-Disposition somewhere.

I'm curious, why is this the case?  I thought the code was refactored
to determine the file name after the headers arrive.  It certainly
looks that way by the output it prints:

{mulj}[~]$ wget www.cnn.com
[...]
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: `index.html'   # not saving to only after the HTTP response

Where does the extra traffic come from?

 Note that it is not available at all in any release version of Wget;
 only in the current development versions. We will be releasing Wget 1.11
 very shortly, which will include the --content-disposition
 functionality; however, this functionality is EXPERIMENTAL only. It
 doesn't quite behave properly, and needs some severe adjustments before
 it is appropriate to leave as default.

If it is not ready for general use, we should consider removing it
from NEWS.  If not, it should be properly documented in the manual.  I
am aware that the NEWS entry claims that the feature is experimental,
but why even mention it if it's not ready for general consumption?
Announcing experimental features in NEWS is a good way to make testers
aware of them during the alpha/beta release cycle, but it should be
avoid in production releases of mature software.

 As to breaking old scripts, I'm not really concerned about that (and
 people who read the NEWS file, as anyone relying on previous
 behaviors for Wget should do, would just need to set
 --no-content-disposition, when the time comes that we enable it by
 default).

Agreed.


Re: Content disposition question

2007-12-10 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hrvoje Niksic wrote:
 Micah Cowan [EMAIL PROTECTED] writes:
 
 Actually, the reason it is not enabled by default is that (1) it is
 broken in some respects that need addressing, and (2) as it is currently
 implemented, it involves a significant amount of extra traffic,
 regardless of whether the remote end actually ends up using
 Content-Disposition somewhere.
 
 I'm curious, why is this the case?  I thought the code was refactored
 to determine the file name after the headers arrive.  It certainly
 looks that way by the output it prints:
 
 {mulj}[~]$ wget www.cnn.com
 [...]
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 Saving to: `index.html'   # not saving to only after the HTTP response
 
 Where does the extra traffic come from?

Your example above doesn't set --content-disposition; if you do, there
is an extra HEAD request sent.

As to why this is the case, I believe it was so that we could properly
handle accepts/rejects, whereas we will otherwise usually assume that we
can match accept/reject against the URL itself (we currently do this
improperly for the -nd -r case, still matching using the generated
file name's suffix).

Beyond that, I'm not sure as to why, and it's my intention that it not
be done in 1.12. Removing it for 1.11 is too much trouble, as the
sending-HEAD and sending-GET is not nearly decoupled enough to do it
without risk (and indeed, we were seeing trouble where everytime we
fixed an issue with the send-head-first issue, something else would
break). I want to do some reworking of gethttp and http_loop before I
will feel comfortable in changing how they work.

 If it is not ready for general use, we should consider removing it
 from NEWS.

I had thought of that. The thing that has kept me from it so far is that
 it is a feature that is desired by many people, and for most of them,
it will work (the issues are pretty minor, and mainly corner-case,
except perhaps for the fact that they are apparently always downloaded
to the top directory, and not the one in which the URL was found).

And, if we leave it out of NEWS and documentation, then, when we answer
people who ask How can I get Wget to respect Content-Disposition
headers?, the natural follow-up will be, Why isn't this mentioned
anywhere in the documentation?. :)

 If not, it should be properly documented in the manual.

Yes... I should be more specific about its shortcomings.

 I am aware that the NEWS entry claims that the feature is experimental,
 but why even mention it if it's not ready for general consumption?
 Announcing experimental features in NEWS is a good way to make testers
 aware of them during the alpha/beta release cycle, but it should be
 avoid in production releases of mature software.

It's pretty much good enough; it's not where I want it, but it _is_
usable. The extra traffic is really the main reason I don't want it
on-by-default.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFHXUFY7M8hyUobTrERAkGPAJwLTDHPqdfP3kIN7Zfxmh8RmjbdMACaA6yG
bkKcZfTt0lGpbU79y+AYXF8=
=ZHEv
-END PGP SIGNATURE-


Re: Content disposition question

2007-12-10 Thread Hrvoje Niksic
Micah Cowan [EMAIL PROTECTED] writes:

 I thought the code was refactored to determine the file name after
 the headers arrive.  It certainly looks that way by the output it
 prints:
 
 {mulj}[~]$ wget www.cnn.com
 [...]
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 Saving to: `index.html'   # not saving to only after the HTTP response
 
 Where does the extra traffic come from?

 Your example above doesn't set --content-disposition;

I'm aware of that, but the above example was supposed to point out the
refactoring that has already taken place, regardless of whether
--content-disposition is specified.  As shown above, Wget always waits
for the headers before determining the file name.  If that is the
case, it would appear that no additional traffic is needed to get
Content-Disposition, Wget simply needs to use the information already
received.

 As to why this is the case, I believe it was so that we could
 properly handle accepts/rejects,

Issuing another request seems to be the wrong way to go about it, but
I haven't thought about it hard enough, so I could be missing a lot of
subtleties.

 I am aware that the NEWS entry claims that the feature is experimental,
 but why even mention it if it's not ready for general consumption?
 Announcing experimental features in NEWS is a good way to make testers
 aware of them during the alpha/beta release cycle, but it should be
 avoid in production releases of mature software.

 It's pretty much good enough; it's not where I want it, but it
 _is_ usable. The extra traffic is really the main reason I don't
 want it on-by-default.

It should IMHO be documented, then.  Even if it's documented as
experimental.


Content disposition question

2007-12-03 Thread Vladimir Niksic
Hi!

I have noticed that wget doesn't automatically use the option 
'--content-disposition'. So what happens is when you download something
from a site that uses content disposition, the resulting file on the
filesystem is not what it should be.

For example, when downloading an Ubuntu torrent from mininova I get:

{uragan}[~/tmp]$ wget http://www.mininova.org/get/946879
--2007-12-03 15:58:46--  http://www.mininova.org/get/946879
Resolving www.mininova.org... 87.233.147.140
Connecting to www.mininova.org|87.233.147.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28064 (27K) [application/x-bittorrent]
Saving to: `946879'

100%[]
28,064  87.0K/s   in 0.3s

2007-12-03 15:58:47 (87.0 KB/s) - `946879' saved [28064/28064]

When use the option --content-disposition:

{uragan}[~/tmp]$ wget --content-disposition
http://www.mininova.org/get/946879
--2007-12-03 15:59:18--  http://www.mininova.org/get/946879
Resolving www.mininova.org... 87.233.147.140
Connecting to www.mininova.org|87.233.147.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 0 [application/x-bittorrent]
--2007-12-03 15:59:18--  http://www.mininova.org/get/946879
Connecting to www.mininova.org|87.233.147.140|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28064 (27K) [application/x-bittorrent]
Saving to: `-{mininova.org}- ubuntu-7.10-desktop-i386.iso.torrent'

100%[]
28,064  47.8K/s   in 0.6s

2007-12-03 15:59:19 (47.8 KB/s) - `-{mininova.org}-
ubuntu-7.10-desktop-i386.iso.torrent' saved [28064/28064]


I realize that I could put this option in .wgetrc, but I think that it
would be better if this was the default because the majority of users is
unaware of this option, and cannot hope to find it unless acquainted
with the inner mechanics of HTTP. Also, it's nearly impossible to find.
I've been googling it and finally managed to dig it up from the
documentation.



Re: Content disposition question

2007-12-03 Thread Matthias Vill
Hi,

we know this. This was just recently discussed on the mailinglist and I
agree with you.
But there are two arguments why this is not default:
a) It's a quite new feature for wget and therefore would brake
compatibility with prior versions and any old script would need to be
rewritten.
b) It's impossible to pre-guess the filename and thus it is not so well
suited for script-usage.

I would like to have this feature enabled by some --interactive switch
(which could include more options and might be easier to find) or as you
suggested as default with an disable switch.

Greetings

Matthias

Vladimir Niksic wrote:
 I have noticed that wget doesn't automatically use the option 
 '--content-disposition'. So what happens is when you download something
 from a site that uses content disposition, the resulting file on the
 filesystem is not what it should be.

 I realize that I could put this option in .wgetrc, but I think that it
 would be better if this was the default because the majority of users is
 unaware of this option, and cannot hope to find it unless acquainted
 with the inner mechanics of HTTP. Also, it's nearly impossible to find.
 I've been googling it and finally managed to dig it up from the
 documentation.
 


Re: Content disposition question

2007-12-03 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Matthias Vill wrote:
 Hi,
 
 we know this. This was just recently discussed on the mailinglist and I
 agree with you.
 But there are two arguments why this is not default:
 a) It's a quite new feature for wget and therefore would brake
 compatibility with prior versions and any old script would need to be
 rewritten.
 b) It's impossible to pre-guess the filename and thus it is not so well
 suited for script-usage.
 
 I would like to have this feature enabled by some --interactive switch
 (which could include more options and might be easier to find) or as you
 suggested as default with an disable switch.

Actually, the reason it is not enabled by default is that (1) it is
broken in some respects that need addressing, and (2) as it is currently
implemented, it involves a significant amount of extra traffic,
regardless of whether the remote end actually ends up using
Content-Disposition somewhere.

Note that it is not available at all in any release version of Wget;
only in the current development versions. We will be releasing Wget 1.11
very shortly, which will include the --content-disposition
functionality; however, this functionality is EXPERIMENTAL only. It
doesn't quite behave properly, and needs some severe adjustments before
it is appropriate to leave as default.

As to breaking old scripts, I'm not really concerned about that (and
people who read the NEWS file, as anyone relying on previous behaviors
for Wget should do, would just need to set --no-content-disposition,
when the time comes that we enable it by default).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFHVDmo7M8hyUobTrERAtYoAKCR6bYexmpqj5Wud6p9evttgDCMgwCfdoQY
oXbPU6EwfQhQhfN0Pi9wC+E=
=t6et
-END PGP SIGNATURE-