Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
 I'm trying to download the same file from the same server, command line
 I use:
 wget --debug -o log  -c -t 0 --load-cookies=cookie_file
 http://rapidshare.com/files/153131390/Blind-Test.rar
 
 Below attached 2 files: log with 1.9.1 and log with 1.10.2
 Both logs are made when Blind-Test.rar was already on my HDD.
 Sorry for some mess in logs, but russian language used on my console.

Thanks very much for providing these, Maksim; they were very helpful.
(Sorry for getting back to you so late: it's been busy lately).

I've confirmed this behavioral difference (though I compared the current
development sources against 1.8.2, rather than 1.10.2 to 1.9.1). Your
logs involve a 302 redirection before arriving at the real file, but
that's just a red herring.

The difference is that when 1.9.1 encountered a server that would
respond to a byte-range request with 200 (meaning it doesn't know how
to send partial contents), but with a Content-Length value matching the
size of the local file, then wget would close the connection and not
proceed to redownload. 1.10.2, on the other hand, would just re-download it.

Actually, I'll have to confirm this, but I think that current Wget will
re-download it, but not overwrite the current content, until it arrives
at some content corresponding to bytes beyond the current content.

I need to investigate further to see if this change was somehow
intentional (though I can't imagine what the reasoning would be); if I
don't find a good reason not to, I'll revert this behavior. Probably for
the 1.12 release, but I might possibly punt it to 1.13 on the grounds
that it's not a recent regression (however, it should really be a quick
fix, so most likely it'll be in for 1.12).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBfOj7M8hyUobTrERAjNTAJ9ayaKLvN4bYS/7o0kYcQywDvfwNgCfcGzz
P9aAwVD6Q/xQuACjU7KF1ng=
=m5QO
-END PGP SIGNATURE-


Re: --mirror and --cut-dirs=2 bug?

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Brock Murch wrote:
 I try to keep a mirror of NASA atteph ancilliary data for modis processing. I 
 know that means little, but I have a cron script that runs 2 times a day. 
 Sometimes it works, and others, not so much. The sh script is listed at the 
 end of this email below. As is the contents of the remote ftp server's root 
 and portions fo the log. 
 
 I don't need all the data on the remote server, only some thus I use 
 --cut-dirs.To make matters stranger, the software (also from NASA) that uses 
 these files, looks for them in a single place on the client machine where the 
 software runs, but needs data from 2 different directories on the remote ftp 
 server. If the data is not on the client machine, the software kindly ftp's 
 the files to the local directory. However, I don't allow write access to that 
 directory as many people use the software and when it is d/l'ed it has the 
 wrong perms for others to use it, thus I mirror the data I need from the ftp 
 site locally. In the script below, there are 2 wget commands, but they are to 
 slightly different directories (MODISA  MODIST).

I wouldn't recommend that. Using the same output directory for two
different source directories seems likely to lead to problems. You'd
most likely be better off by pulling to two locations, and then
combining them afterwards.

I don't know for sure that it _will_ cause problems (except if they
happen to have same-named files), as long as .listing files are being
properly removed (there were some recently-fixed bugs related to that, I
think? ...just appending new listings on top of existing files).

 It appears to me that the problem occurs if there is a ftp server error, and 
 wget starts a retry. wget goes to the server root, gets the .listing from 
 there for some reason (as opposed to the directory it should go to on the 
 server), and then goes to the dir it needs to mirror and can't find the files 
 (that are listed in the root dir) and creates dirs, and then I get No such 
 file errors and recursive directories created. Any advice would be 
 appreciated.

This snippet seems to be the source of the problem:

 Error in server response, closing control connection.
 Retrying.
 
 - --14:53:53--  ftp://oceans.gsfc.nasa.gov/MODIST/ATTEPH/2002/110/
   (try: 2) = `/home1/software/modis/atteph/2002/110/.listing'
 Connecting to oceans.gsfc.nasa.gov|169.154.128.45|:21... connected.
 Logging in as anonymous ... Logged in!
 == SYST ... done.== PWD ... done.
 == TYPE I ... done.  == CWD not required.
 == PASV ... done.== LIST ... done.

That CWD not required bit is erroneous. I'm 90% sure we fixed this
issue recently (though I'm not 100% sure that it went to release: I
believe so).

I believe we made some related fixes more recently. You provided a great
amount of useful information, but one thing that seems to be missing (or
I missed it) is the Wget version number. Judging from the log, I'd say
it's 1.10.2 or older; the most recent version of Wget is 1.11.4; could
you please try to verify whether Wget continues to exhibit this problem
in the latest release version?

I'll also try to look into this as I have time (but it might be awhile
before I can give it some serious attention; it'd be very helpful if you
could do a little more legwork).

- --
Thanks very much,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgNh7M8hyUobTrERAuGoAKCCUoBN0sURKA/51x0o4HN59K8+AACfUYuj
i8XW58MvjvbS3oy4OsOmbpc=
=4kpD
-END PGP SIGNATURE-


Re: --mirror and --cut-dirs=2 bug?

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Micah Cowan wrote:
 I believe we made some related fixes more recently. You provided a great
 amount of useful information, but one thing that seems to be missing (or
 I missed it) is the Wget version number. Judging from the log, I'd say
 it's 1.10.2 or older; the most recent version of Wget is 1.11.4; could
 you please try to verify whether Wget continues to exhibit this problem
 in the latest release version?

This problem looks like the one that Mike Grant fixed in October of
2006: http://hg.addictivecode.org/wget/1.11/rev/161aa64e7e8f, so it
should definitely be fixed in 1.11.4. Please let me know if it isn't.

- --
Regards,
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgY+7M8hyUobTrERArrRAJ4p4Y7jwWfic0Wul7UBnBXlSzD2XQCePifc
kWs00JOULkzJmzozK7lmcfA=
=iSL3
-END PGP SIGNATURE-


Re: --mirror and --cut-dirs=2 bug?

2008-10-27 Thread Brock Murch
Micah,

Thanks for your quick attention to this. Yous, I probably forgot to include 
the version #

[EMAIL PROTECTED] atteph]# wget --version
GNU Wget 1.10.2 (Red Hat modified)

Copyright (C) 2005 Free Software Foundation, Inc.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

Originally written by Hrvoje Niksic [EMAIL PROTECTED].

I will see if I can get the newest version for:
[EMAIL PROTECTED] atteph]# cat /etc/redhat-release
CentOS release 4.2 (Final)

I'll let you know how that goes.

Brock

On Monday 27 October 2008 2:19 pm, Micah Cowan wrote:
 Micah Cowan wrote:
  I believe we made some related fixes more recently. You provided a great
  amount of useful information, but one thing that seems to be missing (or
  I missed it) is the Wget version number. Judging from the log, I'd say
  it's 1.10.2 or older; the most recent version of Wget is 1.11.4; could
  you please try to verify whether Wget continues to exhibit this problem
  in the latest release version?

 This problem looks like the one that Mike Grant fixed in October of
 2006: http://hg.addictivecode.org/wget/1.11/rev/161aa64e7e8f, so it
 should definitely be fixed in 1.11.4. Please let me know if it isn't.



More on query matching [Re: Need Design Documents]

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

kalpana ravi wrote:
 Hi Everybody,

Hi kalpana,

You sent this message to me and [EMAIL PROTECTED]; you
wanted [EMAIL PROTECTED]

 My name is kalpana Ravi.I am planning to contribute to add one of the
 features listed in https://savannah.gnu.org/bugs/?22089. For that i need to
 know the design diagrams to understand better. Does anybody know where the
 UML diagrams are there?

We don't have UML diagrams for wget: you'll just have to read the
sources (which, unfortunately, are messy). I have some rough-draft
diagrams of how I _want_ wget to look eventually, but I'm not done with
those, and anyway they wouldn't help you with wget now. Even if you had
the UML diagrams for the current state, you'd still need to understand
the sources; I really don't think they'd help you much.

More important than understanding the design, is understanding what
needs to be done; we're still getting a grip on that. My current thought
is that there should be a --query-reject (and probably --query-accept,
though the former seems far more useful) that should be matched against
key/value pairs; thus, --query-reject 'foo=baraction=edit' would reject
anything that has foo=bar and action=edit as the key/value pairs in
the query string, even if they're not actually next to each other; an
example rejected URL might be
http://example.com/index.php?a=baction=edittoken=blahfoo=barhergle.

Not all query strings are in the key=value format, so --query-reject
'abc1254' would be allowed, and match against the entire query string.

For an idea how URL filename matching is currently done, you might check
out acceptable src/util.c and the functions it calls, to get an idea
of how query matching might be implemented. However, I'll probably
tackle this bug myself pretty soon if no one else has managed it yet, as
I'm very interested in getting Wget 1.12 finished before long into the
new year (ideally, _before_ the new year, but that probably ain't gonna
happen).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBgt77M8hyUobTrERAnqrAJ921WjEax0kMFf5Ls70Lvvq6LBItgCeL6wj
UWA/2b+kVMw8L8IsVjIAGhI=
=WKJk
-END PGP SIGNATURE-


Re: wget re-download fully downloaded files

2008-10-27 Thread Micah Cowan
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Maksim Ivanov wrote:
 I'm trying to download the same file from the same server, command line
 I use:
 wget --debug -o log  -c -t 0 --load-cookies=cookie_file
 http://rapidshare.com/files/153131390/Blind-Test.rar
 
 Below attached 2 files: log with 1.9.1 and log with 1.10.2
 Both logs are made when Blind-Test.rar was already on my HDD.
 Sorry for some mess in logs, but russian language used on my console.

This is currently being tracked at https://savannah.gnu.org/bugs/?24662

A similar and related bug report is at
https://savannah.gnu.org/bugs/?24642 in which the logs show that
rapidshare.com issues also issues erroneous Content-Range information
when it responds with a 206 Partial Content, which exercised a different
regression* introduced in 1.11.x.

* It's not really a regression, since it's desirable behavior: we now
determine the size of the content from the content-range header, since
content-length is often missing or erroneous for partial content.
However, in this instance of server error, it resulted in less-desirable
behavior than the previous version of Wget. Anyway...

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
GNU Maintainer: wget, screen, teseq
http://micah.cowan.name/
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJBhvA7M8hyUobTrERAty1AKCEscXut6FDXvXlxpuSBtKkii1/awCeJH0M
+JcJ5xG67K7CxHBEcV1x/zY=
=D2uE
-END PGP SIGNATURE-


RE: wget re-download fully downloaded files

2008-10-27 Thread Tony Lewis
Micah Cowan wrote:

 Actually, I'll have to confirm this, but I think that current Wget will
 re-download it, but not overwrite the current content, until it arrives
 at some content corresponding to bytes beyond the current content.

 I need to investigate further to see if this change was somehow
 intentional (though I can't imagine what the reasoning would be); if I
 don't find a good reason not to, I'll revert this behavior.

One reason to keep the current behavior is to retain all of the existing
content in the event of another partial download that is shorter than the
previous one. However, I think that only makes sense if wget is comparing
the new content with what is already on disk.

Tony