RE: .1, .2 before suffix rather than after
Hmm - changing the rename schema would potentially create a HUGE issue with clobbering. For example, and quite hypothetical... Given a directory with the following: index.html index-1.html index.1.html All three are served by the server and rendered by the browser. They are distinct files given the file system and the URL interpretation of the file system by the web server. Now, Wget downloads index.html, then downloads it again. Our choices for the second file are: 1) index.html.1 2) index-1.html 3) index.1.html Of the three, only #1 is pretty much guaranteed *not* to exist on the web server. Why? Because by changing the extension, we've changed the content type. So if our intentions are to not clobber (which, I believe, is the whole point) we are *much* better off sticking with the current schema and creating a file that most can't be served by the web server. Note that this is quite a contrived example to illustrate the point. However, my 2 cents on the behavior - It would be *wonderful* if wget could look at the local file system and rename each version to file.ext.n+1 so the new download is index.html, not index.html.1. I've been caught a couple of times with this, so to me the default behavior is backwards (ie, new file s/b the URL, older files get versioned) Chris Christopher G. Lewis http://www.ChristopherLewis.com -Original Message- From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] Sent: Sunday, November 04, 2007 4:19 PM To: Wget Cc: Christian Roche Subject: Re: .1, .2 before suffix rather than after Hrvoje Niksic [EMAIL PROTECTED] writes: Micah Cowan [EMAIL PROTECTED] writes: Christian Roche has submitted a revised version of a patch to modify the unique-name-finding algorithm to generate names in the pattern foo-n.html rather than foo.html.n. The patch looks good, and will likely go in very soon. foo.html.n has the advantage of simplicity: you can tell at a glance that foo.n is a duplicate of foo. Also, it is trivial to remove the unwanted files by removing foo.*. It just occurred to me that this change breaks backward compatibility. It will break scripts that try to clean up after Wget or that in any way depend on the current naming scheme. smime.p7s Description: S/MIME cryptographic signature
Re: .1, .2 before suffix rather than after
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Christopher G. Lewis wrote: Hmm - changing the rename schema would potentially create a HUGE issue with clobbering. For example, and quite hypothetical... Given a directory with the following: index.html index-1.html index.1.html All three are served by the server and rendered by the browser. They are distinct files given the file system and the URL interpretation of the file system by the web server. Now, Wget downloads index.html, then downloads it again. Our choices for the second file are: 1) index.html.1 2) index-1.html 3) index.1.html Of the three, only #1 is pretty much guaranteed *not* to exist on the web server. Why? Because by changing the extension, we've changed the content type. So if our intentions are to not clobber (which, I believe, is the whole point) we are *much* better off sticking with the current schema and creating a file that most can't be served by the web server. Of course you are 100% correct that it is the whole point. However, while this is indeed a problem, I don't think it's a clobbering problem. I believe Wget would then choose (or could be made to then choose) index-2.html, etc, for the file which on the server is named index-1.html. Of course, while that would resolve clobbering, that would make it virtually impossible to determine what file had what local name, which is entirely unacceptable. I wonder how Wget currently handles perverse cases like index.html.1 actually existing on the server and already on the local system. :) Note that this is quite a contrived example to illustrate the point. Yeah. Unfortunately, though, something like page-1.html, page-2.html, isn't quite so unlikely. It's intended that Reget (I'll call it that for now, until we figure out what the hell we're going to do with that whole cluster of functionality) will have support for a database of download-session metadata, that would handle mappings between the remote URI and the local file. With that, it'd be possible to construct a simple utility which could be invoked like, reget-fmap http://example.com/foo.html; and might spit out something like ./example.com/foo.html. This might couple quite well with providing a plugin hook to control the renaming scheme. Given your excellent points, and the fact that I didn't get the overwhelmingly positive response to this suggestion that I had anticipated, I'd better table this patch. :( However, my 2 cents on the behavior - It would be *wonderful* if wget could look at the local file system and rename each version to file.ext.n+1 so the new download is index.html, not index.html.1. I've been caught a couple of times with this, so to me the default behavior is backwards (ie, new file s/b the URL, older files get versioned) That would of course be substantially more work, and provide even greater opportunity for race conditions/interoperability issues than we already do, but I agree that it'd be nice-to-have. Unfortunately, I don't think there's any way we'll ever do this in Wget: it'd be too confusing for people used to the current way. And while, as Hrvoje pointed out, the currently proposed suffixes patch could potentially break backwards compatibility, it's not likely to do so in a harmful/destructive way, whereas any current scripts that currently download files and then erase the renamed ones will suddenly be destroying the new data, rather than the old, if we reverse the renaming. :\ That problem is partly exacerbated by the fact that, from a certain perspective, we ought to be able to stick our noses in the air and claim that any scripts of that sort ought to have been telling Wget to clobber files, rather than letting Wget rename them and trying to delete them afterwards... but there is not currently a way to Wget to do that. There is no way to ask Wget to clobber files when it normally wouldn't. However, with the proper hook in Reget, it'd be easy enough to have a plugin that handles it this way. Actually, since Reget is looking to probably be an entirely new beast, and we'll certainly have to break compatibility with traditional Wget, we could consider making this the default renaming mechanism for Reget; but I'm still concerned about the extra work, race conditions, and potential for screwing with other programs that may be operating on some of the files involved. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMKRh7M8hyUobTrERCFeGAJwM8yPR35j8rbsqkG8Vk8A1Bdm0YACggbBN 6s7EOEwhxCerjaeuQAblccw= =rdpM -END PGP SIGNATURE-
Re: .1, .2 before suffix rather than after
Hrvoje Niksic wrote: It just occurred to me that this change breaks backward compatibility. It will break scripts that try to clean up after Wget or that in any way depend on the current naming scheme I'm also a bit hesitant about changing the way files get named. With a .1 at the absolute end of the filename I _know_ this file got its name because there already was a file with the same name. If the new file instead is named filename-1.jpg I cannot be certain if this is because of a file collision, or if the original file really had this name, which of course it might have had. If a script is supposed to restore the original filename of a downloaded file (perhaps for future downloads), it's easy to just cut the trailing number, it there is one. How could that be done in an easy and secure way if there is an eventual number before the extension, a number that I don't even know if it's part of the original filename or not? And already having local files named -1.ext is not so uncommon. What happens if there is a local file with that name? -2.ext could be the answer, but that makes it really difficult to find downloaded files programmatically. And how is .tar.gz renamed? .tar-1.gz? Sorry, but I'm not so sure about this.. -- Andreas
Delete a partial download after a timeout
Hello, I use wget to retrieve an XML feed. The problem is that sometimes I get timeout error. I used the good wget options to handle this problem, but when wget retries to download the file again, the data are appended to the file which causes a bad file. Here is wget command line: wget -c -nc -O tmp/out.xml -t 5 -w 60 -T 60 http://url.of.the.feed/feed.xml Can you help me to fix this problem? I am not subscribed to the list, so can you CC'ed me to the reply. Thanks a lot Sydney _ Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers Yahoo! Mail
Re: Delete a partial download after a timeout
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 moisi wrote: Hello, I use wget to retrieve an XML feed. The problem is that sometimes I get timeout error. I used the good wget options to handle this problem, but when wget retries to download the file again, the data are appended to the file which causes a bad file. Here is wget command line: wget -c -nc -O tmp/out.xml -t 5 -w 60 -T 60 http://url.of.the.feed/feed.xml Hm. I'm not sure how Wget could deal with this in a general way. We could certainly ask Wget to truncate the output-file, when the output-file refers to a truncate-able file; but it doesn't always (such as with -O -). I think you'd be better off without using -O, and rename/append the file after downloading. I believe your use of -c and -nc with -O is not meaningful. -O is meant to work very similarly to a redirection. Hm... but current development Wget seems to have a regression with -nc and -O in relation to Wget 1.10.2: $ ls -a . .. $ wget -c -nc -O foo micah.cowan.name - --2007-11-06 10:55:28-- http://micah.cowan.name/ Resolving micah.cowan.name... 66.150.225.51 Connecting to micah.cowan.name|66.150.225.51|:80... connected. HTTP request sent, awaiting response... 200 OK File ‘foo’ already there; not retrieving. $ Question to group: should -nc even work with -O? And if so, I suppose the wget-1.10.2 behavior would be the expected... - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMLl07M8hyUobTrERCM80AJ9ZR0qJfFNdvebtlXVuhvkZpAJ1AwCeI90T C46dvepWUA5Kw47lVrDfYGg= =LA8j -END PGP SIGNATURE-
Re: .1, .2 before suffix rather than after
Andreas Pettersson [EMAIL PROTECTED] writes: And how is .tar.gz renamed? .tar-1.gz? Ouch.
Re: Question re server actions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Alan Thomas wrote: I admittedly do not know much about web server responses, and I have a question about why wget did not retrieve a document. . . . I executed the following wget command: wget --recursive --level=20 --append-output=wget_log.txt --accept=pdf,doc,ppt,xls,zip,tar,gz,mov,avi,mpeg,mpg,wmv --no-parent --no-directories --directory-prefix=TEST_AnyLogic_Docs http://www.xjtek.com; However, it did not get the PDF document found by clicking on this link: http://www.xjtek.com/anylogic/license_agreement. This URL automatically results in a download of a PDF file. Why? Is there a wget option that will include this file? I believe it's being rejected because it doesn't end in a suffix that's in your --accept list; it's a PDF file, but its URL doesn't end in .pdf. It does use Content-Disposition to specify a filename, but the release version of Wget doesn't acknowledge those. If you use the current development version of Wget, and specify -e content_disposition=on, it will download. If you're willing to try that, you'll need to look at http://wget.addictivecode.org/RepositoryAccess for information on how to get the current development version of Wget (you should use the 1.11 repository, not mainline), and special building requirements. - -- HTHm Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMRq97M8hyUobTrERCJ6WAJwK6uv/HlrLmTA7zK5DLZCnswkofQCfbMvJ 6yAiHoWEsXLohuYmQTGlPDo= =DWHZ -END PGP SIGNATURE-
Bugs! [Re: Question re server actions]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Alan Thomas wrote: Thanks. I unzipped those binaries, but I still have a problem. . . . I changed the wget command to: wget --recursive --level=20 --append-output=wget_log.txt -econtent_dispositi on=on --accept=pdf,doc,ppt,xls,zip,tar,gz --no-parent --no-directories --d irectory-prefix=TEST_AnyLogic_Docs http://www.xjtek.com; However, the log file shows: --2007-11-06 21:33:55-- http://www.xjtek.com/ Resolving www.xjtek.com... 207.228.227.14 Connecting to www.xjtek.com|207.228.227.14|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] --2007-11-06 21:34:11-- http://www.xjtek.com/ Connecting to www.xjtek.com|207.228.227.14|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `TEST_AnyLogic_Docs/index.html' 0K .. 128K=0.08s 2007-11-06 21:34:12 (128 KB/s) - `TEST_AnyLogic_Docs/index.html' saved [11091] Removing TEST_AnyLogic_Docs/index.html since it should be rejected. FINISHED --2007-11-06 21:34:12-- Downloaded: 1 files, 11K in 0.08s (128 KB/s) The version of wget is shown as 1.10+devel. Congratulations! Looks like you've discovered a bug! :\ And just in time, too, as we're expecting to release 1.11 any day now. When I try your version with --debug, it looks like it thinks all the links are trying to escape upwards: that is, it thinks that they disobey your --no-parents. You should be able to remove the --no-parents from your command-line, and it will work, as in your case there _are_ no parents to traverse to, and the --no-parents is superfluous. I also discovered (what I consider to be) a bug, in that wget -e content_disposition=on --accept=pdf -r http://www.xjtek.com/anylogic/license_agreement/ downloads the file to ./License_AnyLogic_6.x.x.pdf, rather than to www.xjtek.com/file/114/License_AnyLogic_6.x.x.pdf (the dirname for which matches its URL after redirection). Also, I`m not sure why - is required vice -- in front of the new option. It's not a long option; it's the short option -e, followed by an argument, content_disposition=on. There is not currently a long-option version for this. Support for Content-Disposition will be enabled by default in Wget 1.12, so a long-option probably won't be added (unless it's to disable the support). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMVTe7M8hyUobTrERCFC0AJ9cSLdrnQOD7I770y5yBLPpNer6ggCfcGMj G6q+mYUI+oooD9xkHURxTVw= =ApQs -END PGP SIGNATURE-