gzip question
Does wget automatically decompress gzip compressed files? Is there a way to get wget NOT to decompress gzip cpmpressed files, but to download them as the gzipped file? Thanks, Christopher
gzip question
Does wget automatically decompress gzip compressed files? Is there a way to get wget NOT to decompress gzip cpmpressed files, but to download them as the gzipped file? Thanks, Christopher
Re: gzip question
From: Christopher Eastwood Does wget automatically decompress gzip compressed files? I don't think so. Have you any evidence that it does this? (Wget version? OS? Example with transcript?) Is there a way to get wget NOT to decompress gzip cpmpressed files, but to download them as the gzipped file? Just specify the gzip-compressed file, so far as I know. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
gzip question
Does wget automatically decompress gzip compressed files? Is there a way to get wget NOT to decompress gzip cpmpressed files, but to download them as the gzipped file? Thanks, Christopher
RE: gzip question
wget --header='Accept-Encoding: gzip, deflate' http://{gzippedcontent} -Original Message- From: Steven M. Schweda [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 19, 2007 2:57 PM To: WGET@sunsite.dk Cc: Christopher Eastwood Subject: Re: gzip question From: Christopher Eastwood Does wget automatically decompress gzip compressed files? I don't think so. Have you any evidence that it does this? (Wget version? OS? Example with transcript?) Is there a way to get wget NOT to decompress gzip cpmpressed files, but to download them as the gzipped file? Just specify the gzip-compressed file, so far as I know. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: gzip question
From: Christopher Eastwood wget --header=3D'Accept-Encoding: gzip, deflate' http://{gzippedcontent} Doctor, it hurts when I do this. Don't do that. What does it do without --header='Accept-Encoding: gzip, deflate'? [...] (Wget version? OS? Example with transcript?) Still waiting for those data. Also, when I say Example, I normally mean An actual example, that is, one which can be tested and verified. Adding -d to the wget command can also be informative. SMS.
Re: Question about spidering
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Srinivasan Palaniappan wrote: I am using WGET version 1.10.2, and trying to crawl through a secured site (that we are developing for our customer) I noticed two things. WGET is not downloading all the binaries in the website. It downloads about 30% of it then skips the rest of the documents. But I don't see any log files that shows me some kind of error messaging saying unable to download during spidering, I am not sure I am doing the right thing can you let me know from the following .wgetrc file and the command line I run. .wgetrc exclude_directories = /ascp/commerce/catalog,/ascp/commerce/checkout,/ascp/commerce/user,/ascp/commerce/common,/ascp/commerce/javascript,/ascp/commerce/css include_directories = /ascp/commerce,/ascp/commerce/scp/downloads dir_prefix=\spiderfiles\ascpProd\wget domains=www.mysite.com no_parent=on secure-protocol=SSLv3 ^^^ This should use an underscore, not a dash. wget -r l5 --save-headers --no-check-certificate https://www.mystie.com ^^ - -r doesn't take an argument. Perhaps you wanted a -l before the 15? In addition, I noticed when the metadata information written to the downloaded file has only HTTP has scheme, which is somewhat weird do you know anything about it? I'm not understanding you here. Do you mean that it said, https://...: Unsupported scheme? In that case, I don't see how it could have downloaded 30% of anything, as it means it wasn't compiled with support for SSL and HTTPS. The best way to try to see what might be going on, is to invoke wget with the --debug flag, and probably use the -o logfile option. That could help us to see what might be going on. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHYCXx7M8hyUobTrERAs4AAJsFXHLNnV/9hmtNNd03tR8jlCswkwCeP7eA wKWaIMY2XZk5vwP4RK0eVo8= =rPh2 -END PGP SIGNATURE-
Re: Question about spidering
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Micah Cowan wrote: Srinivasan Palaniappan wrote: wget -r l5 --save-headers --no-check-certificate https://www.mystie.com ^^ -r doesn't take an argument. Perhaps you wanted a -l before the 15? Or a - before the l5. Curse the visual ambiguity between l and 1! - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHYDHp7M8hyUobTrERAvW4AKCEWlK9fTgFiZEDkG0E1hYmkLrYygCgkGD4 yPuP2RulYyKY/hSIxD+ZJTI= =83w+ -END PGP SIGNATURE-
Question about spidering
Hi, I am using WGET version 1.10.2, and trying to crawl through a secured site (that we are developing for our customer) I noticed two things. WGET is not downloading all the binaries in the website. It downloads about 30% of it then skips the rest of the documents. But I don't see any log files that shows me some kind of error messaging saying unable to download during spidering, I am not sure I am doing the right thing can you let me know from the following .wgetrc file and the command line I run. .wgetrc exclude_directories = /ascp/commerce/catalog,/ascp/commerce/checkout,/ascp/commerce/user,/ascp/commerce/common,/ascp/commerce/javascript,/ascp/commerce/css include_directories = /ascp/commerce,/ascp/commerce/scp/downloads dir_prefix=\spiderfiles\ascpProd\wget domains=www.mysite.com no_parent=on secure-protocol=SSLv3 command line --- wget -r l5 --save-headers --no-check-certificate https://www.mystie.com In addition, I noticed when the metadata information written to the downloaded file has only HTTP has scheme, which is somewhat weird do you know anything about it? Regards,
Re: Content disposition question
Micah Cowan [EMAIL PROTECTED] writes: Actually, the reason it is not enabled by default is that (1) it is broken in some respects that need addressing, and (2) as it is currently implemented, it involves a significant amount of extra traffic, regardless of whether the remote end actually ends up using Content-Disposition somewhere. I'm curious, why is this the case? I thought the code was refactored to determine the file name after the headers arrive. It certainly looks that way by the output it prints: {mulj}[~]$ wget www.cnn.com [...] HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `index.html' # not saving to only after the HTTP response Where does the extra traffic come from? Note that it is not available at all in any release version of Wget; only in the current development versions. We will be releasing Wget 1.11 very shortly, which will include the --content-disposition functionality; however, this functionality is EXPERIMENTAL only. It doesn't quite behave properly, and needs some severe adjustments before it is appropriate to leave as default. If it is not ready for general use, we should consider removing it from NEWS. If not, it should be properly documented in the manual. I am aware that the NEWS entry claims that the feature is experimental, but why even mention it if it's not ready for general consumption? Announcing experimental features in NEWS is a good way to make testers aware of them during the alpha/beta release cycle, but it should be avoid in production releases of mature software. As to breaking old scripts, I'm not really concerned about that (and people who read the NEWS file, as anyone relying on previous behaviors for Wget should do, would just need to set --no-content-disposition, when the time comes that we enable it by default). Agreed.
Re: Content disposition question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hrvoje Niksic wrote: Micah Cowan [EMAIL PROTECTED] writes: Actually, the reason it is not enabled by default is that (1) it is broken in some respects that need addressing, and (2) as it is currently implemented, it involves a significant amount of extra traffic, regardless of whether the remote end actually ends up using Content-Disposition somewhere. I'm curious, why is this the case? I thought the code was refactored to determine the file name after the headers arrive. It certainly looks that way by the output it prints: {mulj}[~]$ wget www.cnn.com [...] HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `index.html' # not saving to only after the HTTP response Where does the extra traffic come from? Your example above doesn't set --content-disposition; if you do, there is an extra HEAD request sent. As to why this is the case, I believe it was so that we could properly handle accepts/rejects, whereas we will otherwise usually assume that we can match accept/reject against the URL itself (we currently do this improperly for the -nd -r case, still matching using the generated file name's suffix). Beyond that, I'm not sure as to why, and it's my intention that it not be done in 1.12. Removing it for 1.11 is too much trouble, as the sending-HEAD and sending-GET is not nearly decoupled enough to do it without risk (and indeed, we were seeing trouble where everytime we fixed an issue with the send-head-first issue, something else would break). I want to do some reworking of gethttp and http_loop before I will feel comfortable in changing how they work. If it is not ready for general use, we should consider removing it from NEWS. I had thought of that. The thing that has kept me from it so far is that it is a feature that is desired by many people, and for most of them, it will work (the issues are pretty minor, and mainly corner-case, except perhaps for the fact that they are apparently always downloaded to the top directory, and not the one in which the URL was found). And, if we leave it out of NEWS and documentation, then, when we answer people who ask How can I get Wget to respect Content-Disposition headers?, the natural follow-up will be, Why isn't this mentioned anywhere in the documentation?. :) If not, it should be properly documented in the manual. Yes... I should be more specific about its shortcomings. I am aware that the NEWS entry claims that the feature is experimental, but why even mention it if it's not ready for general consumption? Announcing experimental features in NEWS is a good way to make testers aware of them during the alpha/beta release cycle, but it should be avoid in production releases of mature software. It's pretty much good enough; it's not where I want it, but it _is_ usable. The extra traffic is really the main reason I don't want it on-by-default. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHXUFY7M8hyUobTrERAkGPAJwLTDHPqdfP3kIN7Zfxmh8RmjbdMACaA6yG bkKcZfTt0lGpbU79y+AYXF8= =ZHEv -END PGP SIGNATURE-
Re: Content disposition question
Micah Cowan [EMAIL PROTECTED] writes: I thought the code was refactored to determine the file name after the headers arrive. It certainly looks that way by the output it prints: {mulj}[~]$ wget www.cnn.com [...] HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `index.html' # not saving to only after the HTTP response Where does the extra traffic come from? Your example above doesn't set --content-disposition; I'm aware of that, but the above example was supposed to point out the refactoring that has already taken place, regardless of whether --content-disposition is specified. As shown above, Wget always waits for the headers before determining the file name. If that is the case, it would appear that no additional traffic is needed to get Content-Disposition, Wget simply needs to use the information already received. As to why this is the case, I believe it was so that we could properly handle accepts/rejects, Issuing another request seems to be the wrong way to go about it, but I haven't thought about it hard enough, so I could be missing a lot of subtleties. I am aware that the NEWS entry claims that the feature is experimental, but why even mention it if it's not ready for general consumption? Announcing experimental features in NEWS is a good way to make testers aware of them during the alpha/beta release cycle, but it should be avoid in production releases of mature software. It's pretty much good enough; it's not where I want it, but it _is_ usable. The extra traffic is really the main reason I don't want it on-by-default. It should IMHO be documented, then. Even if it's documented as experimental.
Content disposition question
Hi! I have noticed that wget doesn't automatically use the option '--content-disposition'. So what happens is when you download something from a site that uses content disposition, the resulting file on the filesystem is not what it should be. For example, when downloading an Ubuntu torrent from mininova I get: {uragan}[~/tmp]$ wget http://www.mininova.org/get/946879 --2007-12-03 15:58:46-- http://www.mininova.org/get/946879 Resolving www.mininova.org... 87.233.147.140 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 28064 (27K) [application/x-bittorrent] Saving to: `946879' 100%[] 28,064 87.0K/s in 0.3s 2007-12-03 15:58:47 (87.0 KB/s) - `946879' saved [28064/28064] When use the option --content-disposition: {uragan}[~/tmp]$ wget --content-disposition http://www.mininova.org/get/946879 --2007-12-03 15:59:18-- http://www.mininova.org/get/946879 Resolving www.mininova.org... 87.233.147.140 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 0 [application/x-bittorrent] --2007-12-03 15:59:18-- http://www.mininova.org/get/946879 Connecting to www.mininova.org|87.233.147.140|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 28064 (27K) [application/x-bittorrent] Saving to: `-{mininova.org}- ubuntu-7.10-desktop-i386.iso.torrent' 100%[] 28,064 47.8K/s in 0.6s 2007-12-03 15:59:19 (47.8 KB/s) - `-{mininova.org}- ubuntu-7.10-desktop-i386.iso.torrent' saved [28064/28064] I realize that I could put this option in .wgetrc, but I think that it would be better if this was the default because the majority of users is unaware of this option, and cannot hope to find it unless acquainted with the inner mechanics of HTTP. Also, it's nearly impossible to find. I've been googling it and finally managed to dig it up from the documentation.
Re: Content disposition question
Hi, we know this. This was just recently discussed on the mailinglist and I agree with you. But there are two arguments why this is not default: a) It's a quite new feature for wget and therefore would brake compatibility with prior versions and any old script would need to be rewritten. b) It's impossible to pre-guess the filename and thus it is not so well suited for script-usage. I would like to have this feature enabled by some --interactive switch (which could include more options and might be easier to find) or as you suggested as default with an disable switch. Greetings Matthias Vladimir Niksic wrote: I have noticed that wget doesn't automatically use the option '--content-disposition'. So what happens is when you download something from a site that uses content disposition, the resulting file on the filesystem is not what it should be. I realize that I could put this option in .wgetrc, but I think that it would be better if this was the default because the majority of users is unaware of this option, and cannot hope to find it unless acquainted with the inner mechanics of HTTP. Also, it's nearly impossible to find. I've been googling it and finally managed to dig it up from the documentation.
Re: Content disposition question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Matthias Vill wrote: Hi, we know this. This was just recently discussed on the mailinglist and I agree with you. But there are two arguments why this is not default: a) It's a quite new feature for wget and therefore would brake compatibility with prior versions and any old script would need to be rewritten. b) It's impossible to pre-guess the filename and thus it is not so well suited for script-usage. I would like to have this feature enabled by some --interactive switch (which could include more options and might be easier to find) or as you suggested as default with an disable switch. Actually, the reason it is not enabled by default is that (1) it is broken in some respects that need addressing, and (2) as it is currently implemented, it involves a significant amount of extra traffic, regardless of whether the remote end actually ends up using Content-Disposition somewhere. Note that it is not available at all in any release version of Wget; only in the current development versions. We will be releasing Wget 1.11 very shortly, which will include the --content-disposition functionality; however, this functionality is EXPERIMENTAL only. It doesn't quite behave properly, and needs some severe adjustments before it is appropriate to leave as default. As to breaking old scripts, I'm not really concerned about that (and people who read the NEWS file, as anyone relying on previous behaviors for Wget should do, would just need to set --no-content-disposition, when the time comes that we enable it by default). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iD8DBQFHVDmo7M8hyUobTrERAtYoAKCR6bYexmpqj5Wud6p9evttgDCMgwCfdoQY oXbPU6EwfQhQhfN0Pi9wC+E= =t6et -END PGP SIGNATURE-
Re: Question re server actions
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Alan Thomas wrote: I admittedly do not know much about web server responses, and I have a question about why wget did not retrieve a document. . . . I executed the following wget command: wget --recursive --level=20 --append-output=wget_log.txt --accept=pdf,doc,ppt,xls,zip,tar,gz,mov,avi,mpeg,mpg,wmv --no-parent --no-directories --directory-prefix=TEST_AnyLogic_Docs http://www.xjtek.com; However, it did not get the PDF document found by clicking on this link: http://www.xjtek.com/anylogic/license_agreement. This URL automatically results in a download of a PDF file. Why? Is there a wget option that will include this file? I believe it's being rejected because it doesn't end in a suffix that's in your --accept list; it's a PDF file, but its URL doesn't end in .pdf. It does use Content-Disposition to specify a filename, but the release version of Wget doesn't acknowledge those. If you use the current development version of Wget, and specify -e content_disposition=on, it will download. If you're willing to try that, you'll need to look at http://wget.addictivecode.org/RepositoryAccess for information on how to get the current development version of Wget (you should use the 1.11 repository, not mainline), and special building requirements. - -- HTHm Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMRq97M8hyUobTrERCJ6WAJwK6uv/HlrLmTA7zK5DLZCnswkofQCfbMvJ 6yAiHoWEsXLohuYmQTGlPDo= =DWHZ -END PGP SIGNATURE-
Bugs! [Re: Question re server actions]
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Alan Thomas wrote: Thanks. I unzipped those binaries, but I still have a problem. . . . I changed the wget command to: wget --recursive --level=20 --append-output=wget_log.txt -econtent_dispositi on=on --accept=pdf,doc,ppt,xls,zip,tar,gz --no-parent --no-directories --d irectory-prefix=TEST_AnyLogic_Docs http://www.xjtek.com; However, the log file shows: --2007-11-06 21:33:55-- http://www.xjtek.com/ Resolving www.xjtek.com... 207.228.227.14 Connecting to www.xjtek.com|207.228.227.14|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] --2007-11-06 21:34:11-- http://www.xjtek.com/ Connecting to www.xjtek.com|207.228.227.14|:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] Saving to: `TEST_AnyLogic_Docs/index.html' 0K .. 128K=0.08s 2007-11-06 21:34:12 (128 KB/s) - `TEST_AnyLogic_Docs/index.html' saved [11091] Removing TEST_AnyLogic_Docs/index.html since it should be rejected. FINISHED --2007-11-06 21:34:12-- Downloaded: 1 files, 11K in 0.08s (128 KB/s) The version of wget is shown as 1.10+devel. Congratulations! Looks like you've discovered a bug! :\ And just in time, too, as we're expecting to release 1.11 any day now. When I try your version with --debug, it looks like it thinks all the links are trying to escape upwards: that is, it thinks that they disobey your --no-parents. You should be able to remove the --no-parents from your command-line, and it will work, as in your case there _are_ no parents to traverse to, and the --no-parents is superfluous. I also discovered (what I consider to be) a bug, in that wget -e content_disposition=on --accept=pdf -r http://www.xjtek.com/anylogic/license_agreement/ downloads the file to ./License_AnyLogic_6.x.x.pdf, rather than to www.xjtek.com/file/114/License_AnyLogic_6.x.x.pdf (the dirname for which matches its URL after redirection). Also, I`m not sure why - is required vice -- in front of the new option. It's not a long option; it's the short option -e, followed by an argument, content_disposition=on. There is not currently a long-option version for this. Support for Content-Disposition will be enabled by default in Wget 1.12, so a long-option probably won't be added (unless it's to disable the support). - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHMVTe7M8hyUobTrERCFC0AJ9cSLdrnQOD7I770y5yBLPpNer6ggCfcGMj G6q+mYUI+oooD9xkHURxTVw= =ApQs -END PGP SIGNATURE-
Re: wget -o question
From: Micah Cowan But, since any specific transaction is unlikely to take such a long time, the spread of the run is easily deduced by the start and end times, and, in the unlikely event of multiple days, counting time regressions. And if the pages in books were all numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, ..., the reader could easily deduce the actual number for any page, but most folks find it more convenient when all the necessary data are right there in one place. But hey. You're the boss. SMS.
Re: wget -o question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: But, since any specific transaction is unlikely to take such a long time, the spread of the run is easily deduced by the start and end times, and, in the unlikely event of multiple days, counting time regressions. And if the pages in books were all numbered 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, ..., the reader could easily deduce the actual number for any page, but most folks find it more convenient when all the necessary data are right there in one place. To my mind, books are much more likely to cross 10-page boundaries several severals of times, than Wget is to cross more than just one 24-hour boundary. And, there's always date; wget; date... - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHAJch7M8hyUobTrERCKMDAKCFxnnZrB0vIrquoMi5x/F+32DlCwCcDWdP 3U+0+vCH1tXGCJ3pk9KR3xM= =ZDLY -END PGP SIGNATURE-
Re: wget -o question
My usage is counter to your assumptions below. I run every hour to connect to 1,000 instruments (1,500 in 12 months) dispersed over the entire western US and Alaska. I append log messages for all runs from a day to a single file. This is an important debugging tool for us. We have mostly VSAT and CDMA connections for remote instruments, but many other variations. Small bandwidth, large latency, and potentially large backlogs of data means we can run for a couple days catching up with an instrument - rare, but it happens. The current timestamping is a PAIN for us to automatically parse. A change as proposed here is very simple, but would be VERY useful. Right now, we have 116 gigabytes of wget log files. Jim On Sun, 30 Sep 2007, Micah Cowan wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: From: Micah Cowan - tms = time_str (NULL); + tms = datetime_str (NULL); Does anyone think there's any general usefulness for this sort of thing? I don't care much, but it seems like a fairly harmless change with some benefit. Of course, I use an OS where a directory listing which shows date and time does so using a consistent and constant format, independent of the age of a file, so I may be biased. :) Though honestly, what this change buys you above simply doing date; wget, I don't know. I think maybe I won't bother, at least for now. Though if I were considering such a change, I'd probably just have wget mention the date at the start of its run, rather than repeat it for each transaction. Obviously wouldn't be a high-priority change... :) That sounds reasonable, except for a job which begins shortly before midnight. I considered this, along with the unlikely 24-hour wget run. But, since any specific transaction is unlikely to take such a long time, the spread of the run is easily deduced by the start and end times, and, in the unlikely event of multiple days, counting time regressions. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHAIP67M8hyUobTrERCFFIAJ9Pltuwqr0FeOtlwuFPotKxoBa6TgCeKb2l dtRfakFDQ47qcUJJFKXPVwY= =t50d -END PGP SIGNATURE-
Re: wget -o question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Jim Wright wrote: My usage is counter to your assumptions below.[...] A change as proposed here is very simple, but would be VERY useful. Okay. Guess I'm sold, then. :D - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHAKcq7M8hyUobTrERCCxhAKCPbzNRHGkVbZTcaEBlI7xNqroJbACeKSYO kdixUTJro4Pp3CszOYdjfHE= =NaSh -END PGP SIGNATURE-
Re: wget -o question
Micah Cowan micah at cowan.name writes: Jim Wright wrote: My usage is counter to your assumptions below.[...] A change as proposed here is very simple, but would be VERY useful. Okay. Guess I'm sold, then. :D -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ Thank you all for your replies. Yes, it is very needed. I use wget on WIN OS. I have a .cmd file that performs wget for several days/weeks/months if needed so the date information is very usefull. Thank you. Saso
Re: wget -o question
From: Micah Cowan - tms = time_str (NULL); + tms = datetime_str (NULL); Does anyone think there's any general usefulness for this sort of thing? I don't care much, but it seems like a fairly harmless change with some benefit. Of course, I use an OS where a directory listing which shows date and time does so using a consistent and constant format, independent of the age of a file, so I may be biased. Though if I were considering such a change, I'd probably just have wget mention the date at the start of its run, rather than repeat it for each transaction. Obviously wouldn't be a high-priority change... :) That sounds reasonable, except for a job which begins shortly before midnight. I'd say that it makes more sense to do it the same way every time. Otherwise, why bother displaying the hour every time, when it changes so seldom? Or the minute? Eleven bytes more per file in the log doesn't seem to me to be a big price to pay for consistent simplicity. Or you could let the victim specify a strptime() format string, and satisfy everyone. Personally, I'd just change time_str() to datetime_str() in a couple of places. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: wget -o question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Steven M. Schweda wrote: From: Micah Cowan - tms = time_str (NULL); + tms = datetime_str (NULL); Does anyone think there's any general usefulness for this sort of thing? I don't care much, but it seems like a fairly harmless change with some benefit. Of course, I use an OS where a directory listing which shows date and time does so using a consistent and constant format, independent of the age of a file, so I may be biased. :) Though honestly, what this change buys you above simply doing date; wget, I don't know. I think maybe I won't bother, at least for now. Though if I were considering such a change, I'd probably just have wget mention the date at the start of its run, rather than repeat it for each transaction. Obviously wouldn't be a high-priority change... :) That sounds reasonable, except for a job which begins shortly before midnight. I considered this, along with the unlikely 24-hour wget run. But, since any specific transaction is unlikely to take such a long time, the spread of the run is easily deduced by the start and end times, and, in the unlikely event of multiple days, counting time regressions. - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHAIP67M8hyUobTrERCFFIAJ9Pltuwqr0FeOtlwuFPotKxoBa6TgCeKb2l dtRfakFDQ47qcUJJFKXPVwY= =t50d -END PGP SIGNATURE-
wget -o question
Hi all, I have a question regarding the -o switch: currently I see that log file contains timestamp ONLY. Is it possible to tell wget to include date too? Thank you. Saso
Question
Hi All, I am wondering if there is a way that I can download pdf files and organize them in a directory with Wget or should I write a code for that? If I need to write a code for that, would you please let me know if there is any sample code available? Thanks in advance - Shape Yahoo! in your own image. Join our Network Research Panel today!
Re: Question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Andra Isan wrote: I am wondering if there is a way that I can download pdf files and organize them in a directory with Wget or should I write a code for that? If I need to write a code for that, would you please let me know if there is any sample code available? Hello Andra, I don't think your request is very clear. Certainly you can download PDF files with Wget. What do you mean by organize them in a directory? What sort of organization do you want? Please be as specific as possible. - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGuLVB7M8hyUobTrERCCsbAJ9zIhS8o930RFIQZOil+xFol4pj3gCcDEzw dvOkSyrG+VAstrI8bOr+Nks= =0l7Q -END PGP SIGNATURE-
Re: Question
I have a paper proceeding and I want to follow a link of that proceeding and go to a paper link, then follow the paper link and go to author link and then follow author link which leads to all the paper that the author has written. I want to place all these pdf files( papers of one author) into a directory. So, at the end I have directories of all authors containing papers that those authors have written. (one directory for each author) I am not sure if I can do it with Wget or not. Please let me know your idea Thanks in advance Micah Cowan [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Andra Isan wrote: I am wondering if there is a way that I can download pdf files and organize them in a directory with Wget or should I write a code for that? If I need to write a code for that, would you please let me know if there is any sample code available? Hello Andra, I don't think your request is very clear. Certainly you can download PDF files with Wget. What do you mean by organize them in a directory? What sort of organization do you want? Please be as specific as possible. - -- Thanks, Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGuLVB7M8hyUobTrERCCsbAJ9zIhS8o930RFIQZOil+xFol4pj3gCcDEzw dvOkSyrG+VAstrI8bOr+Nks= =0l7Q -END PGP SIGNATURE- - Pinpoint customers who are looking for what you sell.
Re: Question
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 It seems to me that you can simply start a recursive, non-parent-traversing fetch (-r -np) of the page with the links, and you'll end up with the PDF files you want (plus anything else linked to on that page). If the PDF files are stored in different directories on the website, they'll be in different directories in the fetch; otherwise, they won't be, and yeah you'd need to write some script to do what you need (sorry, no samples available). - -Micah Andra Isan wrote: I have a paper proceeding and I want to follow a link of that proceeding and go to a paper link, then follow the paper link and go to author link and then follow author link which leads to all the paper that the author has written. I want to place all these pdf files( papers of one author) into a directory. So, at the end I have directories of all authors containing papers that those authors have written. (one directory for each author) I am not sure if I can do it with Wget or not. */Micah Cowan [EMAIL PROTECTED]/* wrote: I don't think your request is very clear. Certainly you can download PDF files with Wget. What do you mean by organize them in a directory? What sort of organization do you want? Please be as specific as possible. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGuL097M8hyUobTrERCKqYAJ9/tC05b5+NI2xacmJfNqzQnzZEdgCfY+m7 UbasnhbVBKEk13w82PcJO6Q= =TeLr -END PGP SIGNATURE-
Re: Question about the frame
On Jun 26, 2007, at 11:50 PM, Micah Cowan wrote: After running $ wget -H -k -p http://www.fdoxnews.com/ It downloaded all of the relevant files. However, the results were still not viewable until I edited the link in www.fdoxnews.com/index.html, replacing the ? with %3F (index.mas%3Fepl=...). Probably, wget should have done that when converting the links, considering that it named the file with a ?, but left it literally in the converted link; ? is a special character for URIs, and cannot be part of filenames unless they are encoded. I'll make note of that in my buglist. It appears that this is actually by design. If -E (--html-extension) is not specified, `?' will not be replaced with `%3F'. From src/ convert.c: We quote ? as %3F to avoid passing part of the file name as the parameter when browsing the converted file through HTTP. However, it is safe to do this only when `--html-extension' is turned on. This is because converting index.html?foo=bar to index.html%3Ffoo=bar would break local browsing, as the latter isn't even recognized as an HTML file! However, converting index.html?foo=bar.html to index.html%3Ffoo=bar.html should be safe for both local and HTTP-served browsing. Running $ wget -E -H -k -p http://www.fdoxnews.com/ does the right thing. -Ben
Re: Question about the frame
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Ben Galin wrote: On Jun 26, 2007, at 11:50 PM, Micah Cowan wrote: After running $ wget -H -k -p http://www.fdoxnews.com/ It downloaded all of the relevant files. However, the results were still not viewable until I edited the link in www.fdoxnews.com/index.html, replacing the ? with %3F (index.mas%3Fepl=...). Probably, wget should have done that when converting the links, considering that it named the file with a ?, but left it literally in the converted link; ? is a special character for URIs, and cannot be part of filenames unless they are encoded. I'll make note of that in my buglist. It appears that this is actually by design. If -E (--html-extension) is not specified, `?' will not be replaced with `%3F'. From src/convert.c: We quote ? as %3F to avoid passing part of the file name as the parameter when browsing the converted file through HTTP. However, it is safe to do this only when `--html-extension' is turned on. This is because converting index.html?foo=bar to index.html%3Ffoo=bar would break local browsing, as the latter isn't even recognized as an HTML file! However, converting index.html?foo=bar.html to index.html%3Ffoo=bar.html should be safe for both local and HTTP-served browsing. Running $ wget -E -H -k -p http://www.fdoxnews.com/ does the right thing. Okay, I'll remove that item, then. Thanks very much for looking into that, Ben! - -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGiYs37M8hyUobTrERCL+eAKCCh/FYkwvEJuHlPokRT0CfZBMJLwCeNMya 8KL3KkjWZFagsyLqHo+1g2A= =kPXe -END PGP SIGNATURE-
Question about the frame
Hi, I am using the following command: wget -p url the url has frames. the url retrieves a page that has set of frames. But wget doesn't retrieve the html pages of the frames urls. Is there any bug or i am missing something? Also the command wget -r -l 2 url (url has frames) the above command doesn't retrieve the html pages of the urls in the frames. Does wget has any problem with the frames in the html page? I am using version 1.10.2. Note that I am not subscribing to the maling list group, so please include my email in the cc Thanks! -mish
Re: Question about the frame
Mishari Al-Mishari wrote: Hi, I am using the following command: wget -p url the url has frames. the url retrieves a page that has set of frames. But wget doesn't retrieve the html pages of the frames urls. Is there any bug or i am missing something? Works fine for me. In fact, if the frames have frames, it'll get those too. How many nested frames have you? Also the command wget -r -l 2 url (url has frames) the above command doesn't retrieve the html pages of the urls in the frames. These two examples strongly suggest that you have a large (2) number of nested frames. wget will only recurse two levels of page-prerequisites with the -p option. However, if you can't be specific about the URL you're trying, we can't be specific about what's going on. I'd recommend you use the -d (debug) option, redirect the log to a file (-o wget-log), and check the log for a string like: Not descending further (if you're running it in an English locale), which is a good way to tell if it's run into more nested frames than it is willing to pursue. I am using version 1.10.2. Me too. :) -- Micah J. Cowan Programmer, musician, typesetting enthusiast, gamer... http://micah.cowan.name/
RE: Question on wget upload/dload usage
Joe Kopra wrote: The wget statement looks like: wget --post-file=serverdata.mup -o postlog -O survey.html http://www14.software.ibm.com/webapp/set2/mds/mds --post-file does not work the way you want it to; it expects a text file that contains something like this: a=1b=2 and it sends that raw text to the server in a POST request using a Content-Type of application/x-www-form-urlencoded. If you run it with -d, you will see something like this: POST /someurl HTTP/1.0 User-Agent: Wget/1.10 Accept: */* Host: www.exelana.com Connection: Keep-Alive Content-Type: application/x-www-form-urlencoded Content-Length: 7 ---request end--- [writing POST file data ... done] To post a file as an argument, you need a Content-Type of multipart/form-data, which wget does not currently support. Tony
RE: simple wget question
This is something that is not supported by the http protocol. If you access the site via ftp://..., then you can use wildcards like *.pdf -Original Message- From: R Kimber [mailto:[EMAIL PROTECTED] Sent: Saturday, May 12, 2007 06:43 To: wget@sunsite.dk Subject: Re: simple wget question On Thu, 10 May 2007 16:04:41 -0500 (CDT) Steven M. Schweda wrote: From: R Kimber Yes there's a web page. I usually know what I want. There's a difference between knowing what you want and being able to describe what you want so that it makes sense to someone who does not know what you want. Well I was wondering if wget had a way of allowing me to specify it. But won't a recursive get get more than just those files? Indeed, won't it get everything at that level? The accept/reject options seem to assume you know what's there and can list them to exclude them. I only know what I want. [...] Are you trying to say that you have a list of URLs, and would like to use one wget command for all instead of one wget command per URL? Around here: ALP $ wget -h GNU Wget 1.10.2c, a non-interactive network retriever. Usage: alp$dka0:[utility]wget.exe;13 [OPTION]... [URL]... [...] That [URL]... was supposed to suggest that you can supply more than one URL on the command line. Subject to possible command-line length limitations, this should allow any number of URLs to be specified at once. There's also -1 (--input-file=FILE). No bets, but it looks as if you can specify - for FILE, and it'll read the URLs from stdin, so you could pipe them in from anything. Thanks, but my point is I don't know the full URL, just the pattern. What I'm trying to download is what I might express as: http://www.stirling.gov.uk/*.pdf but I guess that's not possible. I just wondered if it was possible for wget to filter out everything except *.pdf - i.e. wget would look at a site, or a directory on a site, and just accept those files that match a pattern. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
RE: simple wget question
Sorry, I didn't see that Steven has already answered the question. -Original Message- From: Steven M. Schweda [mailto:[EMAIL PROTECTED] Sent: Saturday, May 12, 2007 10:05 To: WGET@sunsite.dk Cc: [EMAIL PROTECTED] Subject: Re: simple wget question From: R Kimber What I'm trying to download is what I might express as: http://www.stirling.gov.uk/*.pdf At last. but I guess that's not possible. In general, it's not. FTP servers often support wildcards. HTTP servers do not. Generally, an HTTP server will not give you a list of all its files the way an FTP server often will, which is why I asked (so long ago) If there's a Web page which has links to all of them, [...]. I just wondered if it was possible for wget to filter out everything except *.pdf - i.e. wget would look at a site, or a directory on a site, and just accept those files that match a pattern. Wget has options for this, as suggested before (wget -h): [...] Recursive accept/reject: -A, --accept=LIST comma-separated list of accepted extensions. -R, --reject=LIST comma-separated list of rejected extensions. [...] but, like many of us, it's not psychic. It needs explict URLs or else instructions (-r) to follow links which it sees in the pages it sucks down. If you don't have a list of the URLs you want, and you don't have URLs for one or more Web pages which contain links to the items you want, then you're probably out of luck. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: simple wget question
On Thu, 10 May 2007 16:04:41 -0500 (CDT) Steven M. Schweda wrote: From: R Kimber Yes there's a web page. I usually know what I want. There's a difference between knowing what you want and being able to describe what you want so that it makes sense to someone who does not know what you want. Well I was wondering if wget had a way of allowing me to specify it. But won't a recursive get get more than just those files? Indeed, won't it get everything at that level? The accept/reject options seem to assume you know what's there and can list them to exclude them. I only know what I want. [...] Are you trying to say that you have a list of URLs, and would like to use one wget command for all instead of one wget command per URL? Around here: ALP $ wget -h GNU Wget 1.10.2c, a non-interactive network retriever. Usage: alp$dka0:[utility]wget.exe;13 [OPTION]... [URL]... [...] That [URL]... was supposed to suggest that you can supply more than one URL on the command line. Subject to possible command-line length limitations, this should allow any number of URLs to be specified at once. There's also -1 (--input-file=FILE). No bets, but it looks as if you can specify - for FILE, and it'll read the URLs from stdin, so you could pipe them in from anything. Thanks, but my point is I don't know the full URL, just the pattern. What I'm trying to download is what I might express as: http://www.stirling.gov.uk/*.pdf but I guess that's not possible. I just wondered if it was possible for wget to filter out everything except *.pdf - i.e. wget would look at a site, or a directory on a site, and just accept those files that match a pattern. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
Re: simple wget question
From: R Kimber What I'm trying to download is what I might express as: http://www.stirling.gov.uk/*.pdf At last. but I guess that's not possible. In general, it's not. FTP servers often support wildcards. HTTP servers do not. Generally, an HTTP server will not give you a list of all its files the way an FTP server often will, which is why I asked (so long ago) If there's a Web page which has links to all of them, [...]. I just wondered if it was possible for wget to filter out everything except *.pdf - i.e. wget would look at a site, or a directory on a site, and just accept those files that match a pattern. Wget has options for this, as suggested before (wget -h): [...] Recursive accept/reject: -A, --accept=LIST comma-separated list of accepted extensions. -R, --reject=LIST comma-separated list of rejected extensions. [...] but, like many of us, it's not psychic. It needs explict URLs or else instructions (-r) to follow links which it sees in the pages it sucks down. If you don't have a list of the URLs you want, and you don't have URLs for one or more Web pages which contain links to the items you want, then you're probably out of luck. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: simple wget question
On Sun, 6 May 2007 21:44:16 -0500 (CDT) Steven M. Schweda wrote: From: R Kimber If I have a series of files such as http://www.stirling.gov.uk/elections07abcd.pdf http://www.stirling.gov.uk/elections07efg.pdf http://www.stirling.gov.uk/elections07gfead.pdf etc is there a single wget command that would download them all, or would I need to do each one separately? It depends. As usual, it might help to know your wget version and operating system, but in this case, a more immediate mystery would be what you mean by them all, and how one would know which such files exist. GNU Wget 1.10.2, Ubuntu 7.04 If there's a Web page which has links to all of them, then you could use a recursive download starting with that page. Look through the output from wget -h, paying particular attention to the sections Recursive download and Recursive accept/reject. If there's no such Web page, then how would wget be able to divine the existence of these files? Yes there's a web page. I usually know what I want. But won't a recursive get get more than just those files? Indeed, won't it get everything at that level? The accept/reject options seem to assume you know what's there and can list them to exclude them. I only know what I want. Not necessarily what I don't want. I did look at the man page, and came to the tentative conclusion that there wasn't a way (or at least an efficient way) of doing it, which is why I asked the question. - Richard -- Richard Kimber http://www.psr.keele.ac.uk/
simple wget question
If I have a series of files such as http://www.stirling.gov.uk/elections07abcd.pdf http://www.stirling.gov.uk/elections07efg.pdf http://www.stirling.gov.uk/elections07gfead.pdf etc is there a single wget command that would download them all, or would I need to do each one separately? Thanks, - Richard Kimber
Re: simple wget question
From: R Kimber If I have a series of files such as http://www.stirling.gov.uk/elections07abcd.pdf http://www.stirling.gov.uk/elections07efg.pdf http://www.stirling.gov.uk/elections07gfead.pdf etc is there a single wget command that would download them all, or would I need to do each one separately? It depends. As usual, it might help to know your wget version and operating system, but in this case, a more immediate mystery would be what you mean by them all, and how one would know which such files exist. If there's a Web page which has links to all of them, then you could use a recursive download starting with that page. Look through the output from wget -h, paying particular attention to the sections Recursive download and Recursive accept/reject. If there's no such Web page, then how would wget be able to divine the existence of these files? If you're running something older than version 1.10.2, you might try getting the current released version first. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: Question re web link conversions
Steven, I`m not trying to blame wget, but rather understand what is going on and perhaps how to correct it. I am using wget version 1.10.2 and Internet Explorer 6.0.2800.1106 on Windows 98SE. However, when I renamed the file, this problem did not occur. So, I think it was something to do with the characters in the filename, which you mentioned. Thanks, Alan - Original Message - From: Steven M. Schweda [EMAIL PROTECTED] To: WGET@sunsite.dk Cc: [EMAIL PROTECTED] Sent: Tuesday, March 13, 2007 1:23 AM Subject: Re: Question re web link conversions From: Alan Thomas As usual, wget without a version does not adequately describe the wget program you're using, Internet Explorer without a version does not adequately describe the Web browser you're using, and I can only assume that you're doing all this on some version or other of Windows. It might help to know which of everything you're using. (But it might not.) Using GNU Wget 1.10.2c built on VMS Alpha V7.3-2 (wget -V), I had no such trouble with either a Mozilla or an old Netscape 3 browser. (I did need to rename the resulting file to something with fewer exotic characters before I could get either browser to admit that the file existed, but it's hard to see how that could matter much.) It's not obvious to me how any browser could invent a URL to which to go Back, so my first guess is operator error, but it's even less obvious to me how anything wget could do could cause this behavior, either. You might try it with Firefox or any browser with no history which might confuse a Back button. If there's a way to blame wget for this, I'll be amazed. (That has happened before, however.) Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Question re web link conversions
I am using the wget command below to get a page from the U.S. Patent Office. This works fine. However, when I open the resulting local file with Internet Explorer (IE), click a link in the file (go to another web site) and the click Back, it goes back to the real web address (http:...) vice the local file (c:\program files\wget\patents\ . . .). Does this have something to do with how wget converts web links? Is there something I should do differently with wget? I`m not clear on why it would do this. When I save this site directly from IE as an HTML file, it works fine. (When I click back, it goes back to the local file.) Thanks, Alan wget --convert-links --directory-prefix=C:\Program Files\wget\patents --no-clobber http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2Sect2=HITOFFu=/netahtml/PTO/search-adv.htmlr=0p=1f=Sl=50Query=ttl/softwared=PG01
Re: Question re web link conversions
From: Alan Thomas As usual, wget without a version does not adequately describe the wget program you're using, Internet Explorer without a version does not adequately describe the Web browser you're using, and I can only assume that you're doing all this on some version or other of Windows. It might help to know which of everything you're using. (But it might not.) Using GNU Wget 1.10.2c built on VMS Alpha V7.3-2 (wget -V), I had no such trouble with either a Mozilla or an old Netscape 3 browser. (I did need to rename the resulting file to something with fewer exotic characters before I could get either browser to admit that the file existed, but it's hard to see how that could matter much.) It's not obvious to me how any browser could invent a URL to which to go Back, so my first guess is operator error, but it's even less obvious to me how anything wget could do could cause this behavior, either. You might try it with Firefox or any browser with no history which might confuse a Back button. If there's a way to blame wget for this, I'll be amazed. (That has happened before, however.) Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
RE: Newbie Question - DNS Failure
I installed wget on a HP-UX box using the depot package. Which depot package? (Anyone can make a depot package.) Depot package came from http://hpux.connect.org.uk/hppd/hpux/Gnu/wget-1.10.2/ Which wget version (wget -V)? 1.10.2 Built how? Installed using swinstall Running on which HP-UX system type? RP-5405 OS version? HP-UX B.11.11 Resolving www.lambton.on.ca... failed: host nor service provided, or not known. First guess: You have a DNS problem, not a wget problem. Can any other program on the system (Web browser, nslookup, ...) resolve names any better? Nslookup and ping work wonderfully. Sorry, I should have mentioned that the first time. Second guess: If DNS works for everyone else, I'd try building wget (preferably a current version, 1.10.2) from the source, and see if that makes any difference. (Who knows what name resolver is linked in with the program in the depot?) Started to try that and got some error messages during the build. I may need to re-investigate. Third guess: Try the ITRC forum for HP-UX, but you'll probably need more info than this there, too: http://forum1.itrc.hp.com/service/forums/familyhome.do?familyId=117 Thanks, I'll check. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: Newbie Question - DNS Failure
From: Terry Babbey Built how? Installed using swinstall How the depot contents were built probably matters more. Second guess: If DNS works for everyone else, I'd try building wget (preferably a current version, 1.10.2) from the source, and see if that makes any difference. [...] Started to try that and got some error messages during the build. I may need to re-investigate. As usual, it might help if you showed what you did, and what happened when you did it. Data like which compiler (and version) could also be useful. On an HP-UX 11.23 Itanium system, starting with my VMS-compatible kit (http://antinode.org/dec/sw/wget.html;, which shouldn't matter much here), I seemed to have no problems building using the HP C compiler, other than getting a bunch of warnings related to socket stuff, which seem to be harmless. (Built using CC=cc ./configure and make.) td176 cc -V cc: HP C/aC++ B3910B A.06.13 [Nov 27 2006] And I see no obvious name resolution problems: td176 ./wget http://www.lambton.on.ca --23:42:04-- http://www.lambton.on.ca/ = `index.html' Resolving www.lambton.on.ca... 192.139.190.140 Connecting to www.lambton.on.ca|192.139.190.140|:80... failed: Connection refuse d. d176 ./wget -V GNU Wget 1.10.2c built on hpux11.23. [...] That's on an HP TestDrive system, which is behind a restrictive firewall, which, I assume, explains the connection problem. (At least it got an IP address for the name.) And it's not the same OS version, and who knows which patches have been applied to either system?, and so on. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Re: Newbie Question - DNS Failure
From: Terry Babbey I installed wget on a HP-UX box using the depot package. Great. Which depot package? (Anyone can make a depot package.) Which wget version (wget -V)? Built how? Running on which HP-UX system type? OS version? Resolving www.lambton.on.ca... failed: host nor service provided, or not known. First guess: You have a DNS problem, not a wget problem. Can any other program on the system (Web browser, nslookup, ...) resolve names any better? Second guess: If DNS works for everyone else, I'd try building wget (preferably a current version, 1.10.2) from the source, and see if that makes any difference. (Who knows what name resolver is linked in with the program in the depot?) Third guess: Try the ITRC forum for HP-UX, but you'll probably need more info than this there, too: http://forums1.itrc.hp.com/service/forums/familyhome.do?familyId=117 Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
Newbie Question - DNS Failure
I installed wget on a HP-UX box using the depot package. Now when I run wget it will not resolve DNS queries. wget http://192.139.190.140 http://192.139.190.140/ works. wget http://www.lambton.on.ca http://www.lambton.on.ca/ fails with the following error: # wget http://www.lambton.on.ca --17:21:22-- http://www.lambton.on.ca/ = `index.html' Resolving www.lambton.on.ca... failed: host nor service provided, or not known. Any help is appreciated. Thanks, Terry Terry Babbey - Technical Support Specialist Information Educational Technology Department Lambton College, Sarnia, Ontario, CANADA
Re: Question!
At 2006-11-07 02:57, Yan Qing Chen wrote: Hi wget, I had found a problem when i try to mirror a ftp site use wget. i use it with -m -b prameters. some files will be recopy when every mirror time. i will how to config a mirror site? Thanks Best Regards, Hi, when modified date reported by web server headers is newer than the timestamp of your file, wget will retrieve again the page (this is correct, modified files have to be retrieved again), maybe this is the cause of your problem. If web server will report a new modified date at every load of the page, even nothing has modified, this is a web server misconfiguration, or maybe intentional config, or maybe bad written dynamic pages (asp, php, etc.) who don't care about the issue. HTH, Andrea
Question!
Hi wget, I had found a problem when i try to mirror a ftp site use wget. i use it with -m -b prameters. some files will be recopy when every mirror time. i will how to config a mirror site? Thanks Best Regards, Yan Qing Chen(陈延庆) Tivoli China Development(IBM CSDL) Internet Email: [EMAIL PROTECTED] Address: Haohai Building 3F, ShangDi 5th Street , HaiDian District, BEIJING, 100085, CHINA
Err, which list? [was: Re: wget question (connect multiple times)]
Tony Lewis wrote: A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk Err, I posted Qs to wget@sunsite.dk and they come via this list - is there a mix-up here? Perhaps why I never get any answers;) (If there's any one else listening to this list and holding back on giving me some fantastic bit of info that'll make my life for ever better because this is a bug list and not a question list - please feel free to email me off-list:) M. -- Morgan Read NEW ZEALAND mailto:mstuffATreadDOTorgDOTnz fedora: Freedom Forever! http://fedoraproject.org/wiki/Overview By choosing not to ship any proprietary or binary drivers, Fedora does differ from other distributions. ... Quote: Max Spevik http://interviews.slashdot.org/article.pl?sid=06/08/17/177220 signature.asc Description: OpenPGP digital signature
RE: Err, which list? [was: Re: wget question (connect multiple times)]
If you are not on the distribution list, you can read the archive at http://www.mail-archive.com/wget@sunsite.dk/ -Original Message- From: Morgan Read [mailto:[EMAIL PROTECTED] Sent: Wednesday, October 18, 2006 4:23 PM To: Tony Lewis Cc: wget@sunsite.dk Subject: Err, which list? [was: Re: wget question (connect multiple times)] Tony Lewis wrote: A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk Err, I posted Qs to wget@sunsite.dk and they come via this list - is there a mix-up here? Perhaps why I never get any answers;) (If there's any one else listening to this list and holding back on giving me some fantastic bit of info that'll make my life for ever better because this is a bug list and not a question list - please feel free to email me off-list:) M. -- Morgan Read NEW ZEALAND mailto:mstuffATreadDOTorgDOTnz fedora: Freedom Forever! http://fedoraproject.org/wiki/Overview By choosing not to ship any proprietary or binary drivers, Fedora does differ from other distributions. ... Quote: Max Spevik http://interviews.slashdot.org/article.pl?sid=06/08/17/177220
Re: wget question (connect multiple times)
Tony Lewis [EMAIL PROTECTED] writes: A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk For what it's worth, [EMAIL PROTECTED] is simply redirected to [EMAIL PROTECTED] It is still useful to have a separate address for bug reports, for at least two reasons. One, the mailing list could theoretically move to another location; and two, at some point we might decide to stop redirecting bug reports to the public mailing list. Neither of these is likely to happen any time soon, as far as I know, though.
wget question (connect multiple times)
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 hi, I hope it is okay to drop a question here. I recently found that if wget downloads one file, my download speed will be Y, but if wget downloads two separate files (from the same server, doesn't matter), the download speed for each of the files will be Y (so my network speed will go up to 2 x Y). So my question is, can I make wget download the same file multiple times simultaneously? In a way, it would run as multiple processes and download parts of the file at the same time, speeding up the download. Hope I could explain my question, sorry about the bad english. Thanks PS. Please consider this as an enhancement request if wget cannot get a file by downloading parts of it simultaneously. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) iD8DBQFFNV4YLM1JzWwJYEYRAsEEAJ9FTx+hURJD5VudhbN2f7Iight3AACcDa6f tO3WuBYygfKLA2Pis8Fbcos= =7kNq -END PGP SIGNATURE-
RE: wget question (connect multiple times)
A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk B) wget does not support multiple time simultaneously C) The decreased per-file download time you're seeing is (probably) because wget is reusing its connection to the server to download the second file. It takes some time to set up a connection to the server regardless of whether you're downloading one byte or one gigabyte of data. For small files, the set up time can be a significant part of the overall download time. Hope that helps! Tony -Original Message- From: t u [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 17, 2006 3:50 PM To: [EMAIL PROTECTED] Subject: wget question (connect multiple times) -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 hi, I hope it is okay to drop a question here. I recently found that if wget downloads one file, my download speed will be Y, but if wget downloads two separate files (from the same server, doesn't matter), the download speed for each of the files will be Y (so my network speed will go up to 2 x Y). So my question is, can I make wget download the same file multiple times simultaneously? In a way, it would run as multiple processes and download parts of the file at the same time, speeding up the download. Hope I could explain my question, sorry about the bad english. Thanks PS. Please consider this as an enhancement request if wget cannot get a file by downloading parts of it simultaneously. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.2.2 (GNU/Linux) iD8DBQFFNV4YLM1JzWwJYEYRAsEEAJ9FTx+hURJD5VudhbN2f7Iight3AACcDa6f tO3WuBYygfKLA2Pis8Fbcos= =7kNq -END PGP SIGNATURE-
Re: wget question (connect multiple times)
-Original Message- From: t u [mailto:[EMAIL PROTECTED] Sent: Tuesday, October 17, 2006 3:50 PM To: [EMAIL PROTECTED] Subject: wget question (connect multiple times) hi, I hope it is okay to drop a question here. I recently found that if wget downloads one file, my download speed will be Y, but if wget downloads two separate files (from the same server, doesn't matter), the download speed for each of the files will be Y (so my network speed will go up to 2 x Y). So my question is, can I make wget download the same file multiple times simultaneously? In a way, it would run as multiple processes and download parts of the file at the same time, speeding up the download. Hope I could explain my question, sorry about the bad english. Thanks PS. Please consider this as an enhancement request if wget cannot get a file by downloading parts of it simultaneously. Tony Lewis wrote: A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk B) wget does not support multiple time simultaneously C) The decreased per-file download time you're seeing is (probably) because wget is reusing its connection to the server to download the second file. It takes some time to set up a connection to the server regardless of whether you're downloading one byte or one gigabyte of data. For small files, the set up time can be a significant part of the overall download time. Hope that helps! Tony Just to make sure this is received as a feature request because of (B), it would be nice to enable wget to download parts of files simultaneously as multiple processes. Example: wget --option-to-multiple-download=3 file.ext wget1 dowloads the first 1/3 of the file wget2 downloads the second 1/3 of the file wget3 downloads the third 1/3 of the file. Thanks for the reply, it helped. sincerely. PS. As a response to (A), I sent my message to bug-wget because I wanted this to be considered as a feature request if it wasn't already implemented. Also, I did not see wget@sunsite.dk at http://www.gnu.org/software/wget/index.html#mailinglists. It only lists bug-wget and wget-patches.
RE: wget question (connect multiple times)
On Tue, 17 Oct 2006, Tony Lewis wrote: A) This is the list for reporting bugs. Questions should go to wget@sunsite.dk I had always understood that bug-wget was just an alias for the regular wget mailing list. Has this changed recently? Doug -- Doug Kaufman Internet: [EMAIL PROTECTED]
Question / Suggestion for wget
If -O output file and -N are both specified, it seems like there should be some mode where the tests for noclobber apply to the output file, not the filename that exists on the remote machine. So, if I run # wget -N http://www.gnu.org/graphics/gnu-head-banner.png -O foo and then # wget -N http://www.gnu.org/graphics/gnu-head-banner.png -O foo the second wget would not clobber and re-get the file. Similarly, it seems odd that # wget http://www.gnu.org/graphics/gnu-head-banner.png and then # wget -N http://www.gnu.org/graphics/gnu-head-banner.png -O foo refuses to write the file named foo. I realize there are already lots of options and the interactions can be pretty confusing, but I think what I'm asking for would be of general usefulness. Maybe I'm sadistic, but -NO amuses me as a why to turn on this behavior. Perhaps just --no-clobber-output-document would be saner. Thanks for your consideration, Mitch
Re: Question / Suggestion for wget
From: Mitch Silverstein If -O output file and -N are both specified [...] When -O foo is specified, it's not a suggestion for a file name to be used later if needed. Instead, wget opens the output file (foo) before it does anything else. Thus, it's always a newly created file, and hence tends to be newer than any any file existing on any server (whose date-time is set correctly). -O has its uses, but it makes no sense to combine it with -N. Remember, too, that wget allows more than one URL to be specified on a command line, so multiple URLs may be associated with a single -O output file. What sense does -N make then? It might make some sense to create some positional option which would allow a URL-specific output file, like, say, -OO, to be used so: wget http://a.b.c/d.e -OO not_dd.e http://g.h.i/j.k -OO not_j.k but I don't know if the existing command-line parser could handle that. Alternatively, some other notation could be adopted, like, say, file=URL, to be used so: wget not_dd.e=http://a.b.c/d.e not_j.k=http://g.h.i/j.k But that's not what -O does, and that's why you're (or your expectations are) doomed. Steven M. Schweda [EMAIL PROTECTED] 382 South Warwick Street(+1) 651-699-9818 Saint Paul MN 55105-2547
question with wget 1.10.2 for windows
Thx for the program first off. This might be a big help for me. What Im trying to do is pull .aspx pages off of a companies website as .html files and save them locally. I also need the images and css to be converted for local also. I cant figure out the proper command to do this. Also its starting at a file within a subfolder within the site example: http://www.example.com/press/release.aspx Thanks for any help Ken SavageWeb Developer Tel. 978-947-2888 Kronos Incorporated 297 Billerica Road Chelmsford, MA 01824 Experts at Improving the Performance of People and Business www.kronos.com
RE: question with wget 1.10.2 for windows
try wget -r -np http://www.example.com/press/release.aspx and then write a script to change all the .aspx files extension to .html (you can't get the server side code BTW, only the html that is generated) Ranjit Sandhu SRA From: Savage, Ken [mailto:[EMAIL PROTECTED] Sent: Thursday, August 17, 2006 3:46 PMTo: [EMAIL PROTECTED]Subject: question with wget 1.10.2 for windows Thx for the program first off. This might be a big help for me. What Im trying to do is pull .aspx pages off of a companies website as .html files and save them locally. I also need the images and css to be converted for local also. I cant figure out the proper command to do this. Also its starting at a file within a subfolder within the site example: http://www.example.com/press/release.aspx Thanks for any help Ken SavageWeb Developer Tel. 978-947-2888Kronos Incorporated 297 Billerica Road Chelmsford, MA 01824 Experts at Improving the Performance of People and Business www.kronos.com
Suggestion/Question
Hallo, yesterday I encountered to wget and I find it a very useful program. I am mirroring a big site, more precious a forum. Because it is a forum under each post you have the action quote. Because that forum has 20.000 post it would download all with action=quote, so I rejected it with R=*action=quote*. It works as in the manual documented, the files aren't stored, but they are downloaded anyway and deleted right after downloading. Why can't wget skip these files resp urls that would make downloading much faster and the site admin would also be happy because he has less traffic. If it has to do that wget must get these files to ensure that it doesn't forget anything downloading then a switch would be useful to turn this behaviour of manually, if the user knows that he doesn't need that an the deeper documents. In a forum e.g. it is absolutely clear that you can abdicate analysing this files because the won't link to any further documents. Thanks for your answer. Markus
Re: wget output question
I do get the full Internet address in the download if I use -k or --convert-links, but not if I use it with -O Ah. Right you are. Looks like a bug to me. Wget/1.10.2a1 (VMS Alpha V7.3-2) says this without -O: 08:53:42 (51.00 MB/s) - `index.html' saved [2674] Converting index.html... 0-14 Converted 1 files in 0.232 seconds. and this with -O: 08:54:06 (297.15 KB/s) - `test.html' saved [2674] test.html: file currently locked by another user [Sounds VMS-specific, yes?] Converting test.html... nothing to do. Converted 1 files in 0.039 seconds. The message from Wget 1.9.1a was less informative: 08:57:13 (297.11 KB/s) - `test.html' saved [2674] : no such file or directory Converting ... nothing to do. Converted 1 files in 0.00 seconds. Without looking at the code, I'd say that someone is calling the conversion code before closing the -O output file. As a user could specify multiple URLs with a single -O output file, it may be difficult to make this work in the same way it would without -O, so a normal download followed by a quick rename (mv) might be your best hope, at least in the short term. Steven M. Schweda (+1) 651-699-9818 382 South Warwick Street[EMAIL PROTECTED] Saint Paul MN 55105-2547
Re: wget output question
Steven M. Schweda wrote: I do get the full Internet address in the download if I use -k or --convert-links, but not if I use it with -O Ah. Right you are. Looks like a bug to me. Is the developer available to confirm this? Without looking at the code, I'd say that someone is calling the conversion code before closing the -O output file. As a user could specify multiple URLs with a single -O output file, it may be difficult to make this work in the same way it would without -O, so a normal download followed by a quick rename (mv) might be your best hope, at least in the short term. Yeah, that's the only thing I've been able to come up with as well. FYI, I tried piping the output to another wget statement and also redirecting to a file, but I pretty much ended up with the same result. If anyone has another suggestion, I'm all ears. :) Jon
wget output question
I'm trying to use wget to do the following: 1. retrieve a single page 2. convert the links in the retrieved page to their full, absolute addresses. 3. save the page with a file name that I specify I thought this would do it: wget -k -O test.html http://www.google.com However, it doesn't convert the links - it just saves the file as test.html. What's the correct syntax to use? Thanks, Jon wget version 1.9
Re: wget output question
1. retrieve a single page That worked. 2. convert the links in the retrieved page to their full, absolute addresses. My wget -h output (Wget 1.10.2a1) says: -k, --convert-links make links in downloaded HTML point to local files. Wget 1.9.1e says: -k, --convert-links convert non-relative links to relative. Not anything about converting relative links to absolute. I don't see an option to do this automatically. 3. save the page with a file name that I specify That worked. That's two out of three. Why would you want this result? Steven M. Schweda (+1) 651-699-9818 382 South Warwick Street[EMAIL PROTECTED] Saint Paul MN 55105-2547
Re: wget output question
Steven M. Schweda wrote: Not anything about converting relative links to absolute. I don't see an option to do this automatically. From the wget man page for --convert-links: ...if a linked file was downloaded, the link will refer to its local name; if it was not downloaded, the link will refer to its full Internet address rather than presenting a broken link... I do get the full Internet address in the download if I use -k or --convert-links, but not if I use it with -O 3. save the page with a file name that I specify Why would you want this result? It's complicated, but the original file name is this long ass URL that contains multiple parameters which I don't need. I just need a simple filename like test.html. I can probably write a script to rename the files, but I'm trying to understand why wget won't allow me to do this. Jon
Wget Log Question
I am trying to use Wget to get all the web pages of the IP Phones. If I use default verbose log option, the log gives me too much unused information: wget -t 1 -i phones_104.txt -O test.txt -o log.txt If I add -nv option, the log files looks fine: 20:14:23 URL:http://10.104.110.10/NetworkConfiguration [6458] - output_104.txt [1] 20:14:23 URL:http://10.104.110.11/NetworkConfiguration [6458] - output_104.txt [1] .. But it only gives me the logs for registered phones's web pages, which means the web pages could be opened. It does not give me the logs for unregistered phones's web pages, those web pages could not be opened. So the following logs are lost in the non-verbose case: -23:21:40-- http://10.104.104.8/NetworkConfiguration = `test.txt' Connecting to 10.104.104.8:80... failed: Connection timed out. Giving up. Those unregistered phone logs are very useful to me. Which options could give me the non-verbose logs for those failed connections? Thanks Dennis
Re: A mirror question
Alle 10:18, giovedì 1 settembre 2005, Pär-Ola Nilsson ha scritto: Hi! Is it possible to get wget to delete files that has disappeared at the remote ftp-host during --mirror? not at the moment, but we might consider adding it to 2.0. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
wget Mailing List question
Would it be possible (and is anyone else interested) to have the subject line of messages posted to this list prefixed with '[wget]'? I belong to several development mailing lists that utilize this feature so that distributed messages to not get removed by spam filters, or deleted by recipients because they have no idea who sent the message. Often the subject line does not indicate that the message relates to wget. For example, I almost deleted a message with the subject Honor --datadir because it looked like spam. If the subject line read [wget] Honor --datadir it would be much easier to deal with. Is anyone else interested in this idea? Is it feasible? Jonathan
Re: wget Mailing List question
On Fri, 26 Aug 2005, Jonathan wrote: Would it be possible (and is anyone else interested) to have the subject line of messages posted to this list prefixed with '[wget]'? Please don't. Subject real estate is precious and limited already is it is. I find subject prefixes highly distdurbing. There is already plenty of info in the headers of each single mail to allow them to get filtered accurately without this being needed. For example: X-Mailing-List: wget@sunsite.dk -- -=- Daniel Stenberg -=- http://daniel.haxx.se -=- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol
Re: wget Mailing List question
Jonathan [EMAIL PROTECTED] writes: Would it be possible (and is anyone else interested) to have the subject line of messages posted to this list prefixed with '[wget]'? I am against munging subject lines of mail messages. The mailing list software provides headers such as `Mailing-List' and `X-Mailing-List' which can be used for better and more reliable filtering.
Re: Question
Mauro Tortonesi [EMAIL PROTECTED] writes: On Saturday 09 July 2005 10:34 am, Abdurrahman ÃARKACIOÄLU wrote: MS Internet Explorer can save a web page as a whole. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ? not at the moment, but it's a planned feature for wget 2.0. Really? I've never heard of a .mht web archive, it seems a Windows-only thing.
Re: Question
While the MHT format is not extremely popular yet, I'm betting it will continue to grow in popularity. It encapsulates an entire web page and graphics, javascripts, style sheets, etc into a single text file. This makes it much easier to email and store. See RFC 2557 for more info: http://www.faqs.org/rfcs/rfc2557.html It is currently supported by Netscape and Mozilla Thunderbird. Frank Hrvoje Niksic wrote: Mauro Tortonesi [EMAIL PROTECTED] writes: On Saturday 09 July 2005 10:34 am, Abdurrahman ÇARKACIOĞLU wrote: MS Internet Explorer can save a web page as a whole. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ? not at the moment, but it's a planned feature for wget 2.0. Really? I've never heard of a .mht web archive, it seems a Windows-only thing. -- Frank McCown Old Dominion University http://www.cs.odu.edu/~fmccown
Re: Question
On Tuesday 09 August 2005 04:37 am, Hrvoje Niksic wrote: Mauro Tortonesi [EMAIL PROTECTED] writes: On Saturday 09 July 2005 10:34 am, Abdurrahman ÃARKACIOÄLU wrote: MS Internet Explorer can save a web page as a whole. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ? not at the moment, but it's a planned feature for wget 2.0. Really? I've never heard of a .mht web archive, it seems a Windows-only thing. oops, my fault. i was in a hurry and i misunderstood what Abdurrahman was asking. what i wanted to say is that we talked about supporting the same html file download mode of firefox, in which you save all the related files in a directory with the same name of the document you donwloaded. i think that would be nice. sorry for the misunderstanding. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it Institute for Human Machine Cognition http://www.ihmc.us GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: Question
Mauro Tortonesi [EMAIL PROTECTED] writes: oops, my fault. i was in a hurry and i misunderstood what Abdurrahman was asking. what i wanted to say is that we talked about supporting the same html file download mode of firefox, in which you save all the related files in a directory with the same name of the document you donwloaded. i think that would be nice. sorry for the misunderstanding. No problem. Once wget -r/-p is taught to parse links on the fly instead of expecting to find them in fixed on-disk locations, writing to MHT should be easy. It seems to be a MIME-like format that builds on the existing concept of multipart/related messages. Instead of converting links to local files, we'd convert them to identifiers (free-form strings) defined with content-id.
Re: Question
On Saturday 09 July 2005 10:34 am, Abdurrahman ÇARKACIOĞLU wrote: MS Internet Explorer can save a web page as a whole. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ? not at the moment, but it's a planned feature for wget 2.0. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it Institute for Human Machine Cognition http://www.ihmc.us GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Question
MS Internet Explorer can save a web page as a whole. That means all the images, Tables, can be saved as a file. It is called as Web Archieve, single file (*.mht). Does it possible for wget ?
wget Question/Suggestion
Is there an option, or could you add one if there isn't, to specify that I want wget to write the downloaded html file, or whatever, to stdout so I can pipe it into some filters in a script?
Re: wget Question/Suggestion
Mark Anderson [EMAIL PROTECTED] writes: Is there an option, or could you add one if there isn't, to specify that I want wget to write the downloaded html file, or whatever, to stdout so I can pipe it into some filters in a script? Yes, use `-O -'.
question
I use wget 1.9.1 In IE6.0 page load OK, but wget return (It's a bug or timeout or ...?) 16:59:59 (9.17 KB/s) - Read error at byte 31472 (Operation timed out).Retrying. --16:59:59-- http://www.nirgos.com/d.htm (try: 2) = `/p5/poisk/spider/resource/www.nirgos.com/d.htm' Connecting to www.nirgos.com[217.16.25.57]:80... connected. HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable 17:00:02 ERROR 416: Requested Range Not Satisfiable.
Re: question
[EMAIL PROTECTED] writes: I use wget 1.9.1 In IE6.0 page load OK, but wget return (It's a bug or timeout or ...?) Thanks for the report. The reported timeout might or might not be incorrect. Wget 1.9.1 on Windows has a known bug of misrepresenting error codes (this has been fixed in 1.10, which is now in beta). The error reported by IE is presumably caused by Wget requesting an impossible range, but it is hard to be sure without access to the debug output. If you can repeat this problem, please mail us the output with the `-d' option.
Re: newbie question
Hi Alan! As the URL starts with https, it is a secure server. You will need to log in to this server in order to download stuff. See the manual for info how to do that (I have no experience with it). Good luck Jens (just another user) I am having trouble getting the files I want using a wildcard specifier (-A option = accept list). The following command works fine to get an individual file: wget https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/160RDTEN_FY06PB.pdf However, I cannot get all PDF files this command: wget -A *.pdf https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/ Instead, I get: Connecting to 164.224.25.30:443 . . . connected. HTTP request sent, awaiting response . . . 400 Bad Request 15:57:52 ERROR 400: Bad Request. I also tried this command without success: wget https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/*.pdf Instead, I get: HTTP request sent, awaiting response . . . 404 Bad Request 15:57:52 ERROR 404: Bad Request. I read through the manual but am still having trouble. What am I doing wrong? Thanks, Alan -- +++ NEU: GMX DSL_Flatrate! Schon ab 14,99 EUR/Monat! +++ GMX Garantie: Surfen ohne Tempo-Limit! http://www.gmx.net/de/go/dsl
RE: newbie question
Alan Thomas wrote: I am having trouble getting the files I want using a wildcard specifier... There are no options on the command line for what you're attempting to do. Neither wget nor the server you're contacting understand *.pdf in a URI. In the case of wget, it is designed to read web pages (HTML files) and then collect a list of resources that are referenced in those pages, which it then retrieves. In the case of the web server, it is designed to return individual objects on request (X.pdf or Y.pdf, but not *.pdf). Some web servers will return a list of files if you specify a directory, but you already tried that in your first use case. Try coming at this from a different direction. If you were going to manually download every PDF from that directory, how would YOU figure out the names of each one? Is there a web page that contains a list somewhere? If so, point wget there. Hope that helps. Tony PS) Jens was mistaken when he said that https requires you to log into the server. Some servers may require authentication before returning information over a secure (https) channel, but that is not a given.
Re: newbie question
Alan Thomas [EMAIL PROTECTED] writes: I am having trouble getting the files I want using a wildcard specifier (-A option = accept list). The following command works fine to get an individual file: wget https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/160RDTEN_FY06PB.pdf However, I cannot get all PDF files this command: wget -A *.pdf https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/ Instead, I get: Connecting to 164.224.25.30:443 . . . connected. HTTP request sent, awaiting response . . . 400 Bad Request 15:57:52 ERROR 400: Bad Request. Does that URL work with a browser? What version of Wget are you using? Using -d will provide a full log of what Wget is doing, as well as the responses it is getting. You can mail the log here, but please be sure it doesn't contain sensitive information (if applicable). This list is public and has public archives. Please note that you also need -r (or even better -r -l1) for -A to work the way you want it.
Re: newbie question
Tony Lewis [EMAIL PROTECTED] writes: PS) Jens was mistaken when he said that https requires you to log into the server. Some servers may require authentication before returning information over a secure (https) channel, but that is not a given. That is true. HTTPS provides encrypted communication between the client and the server, but it doesn't always imply authentication.
Re: newbie question
Hi! Yes, I see now, I misread Alan's original post. I thought he would not even be able to download the single .pdf. Don't know why, as he clearly said it works getting a single pdf. Sorry for the confusion! Jens Tony Lewis [EMAIL PROTECTED] writes: PS) Jens was mistaken when he said that https requires you to log into the server. Some servers may require authentication before returning information over a secure (https) channel, but that is not a given. That is true. HTTPS provides encrypted communication between the client and the server, but it doesn't always imply authentication. -- +++ GMX - Die erste Adresse für Mail, Message, More +++ 1 GB Mailbox bereits in GMX FreeMail http://www.gmx.net/de/go/mail
Re: [unclassified] Re: newbie question
I got the wgetgui program, and used it successfully. The commands were very much like this one. Thanks, Alan - Original Message - From: Technology Freak [EMAIL PROTECTED] To: Alan Thomas [EMAIL PROTECTED] Sent: Thursday, April 14, 2005 10:12 AM Subject: [unclassified] Re: newbie question Alan, You could try something like this wget -r -d -l1 -H -t1 -nd -N -np -A pdf URL On Wed, 13 Apr 2005, Alan Thomas wrote: Date: Wed, 13 Apr 2005 16:02:40 -0400 From: Alan Thomas [EMAIL PROTECTED] To: wget@sunsite.dk Subject: newbie question I am having trouble getting the files I want using a wildcard specifier (-A option = accept list). The following command works fine to get an individual file: wget https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/160RDTEN_FY06PB.pdf However, I cannot get all PDF files this command: wget -A *.pdf https://164.224.25.30/FY06.nsf/($reload)/85256F8A00606A1585256F900040A32F/$FILE/ --- TekPhreak [EMAIL PROTECTED] http://www.tekphreak.com
--continue question
Title: Message i am using wget to retrieve files from a somewhat unstable ftp server. often i kill and restart wget with the --continue option. i use perl to manage the progress of wget and on bad days wget may be restarted 40, 50 or 60 times before the complete file is retrieved. the problem is that sometimes the file ends up containing a section of garbage. i am suspicious that data is not being flushed when i kill wget and/orwget continues on from the wrong spot. does this behavior seem possible? can you recommend a fix or workaround? i am using wget 1.9.1 andrunningon windows xp. thanks --ken If you are not an intended recipient of this e-mail, please notify the sender, delete it and do not read, act upon, print, disclose, copy, retain or redistribute it. Click here for important additional terms relating to this e-mail. http://www.ml.com/email_terms/
wget: question about tag
Hi, I have a question about wget. Is is possible to download other attribute value other than the harcoded ones? For example I have the following html code: ... applet name=RosaApplet archive=./rosa/rosa.jar code=Rosa2000 width=400 height=300 MAYSCRIPT param name=TB_POSITION value=right param name=TB_ALIGN value=top param name=IMG_URL value=/ms_tmp/1107184599245591.gif param name=INP_FORM_NAME value=main ... I want to retrieve the image under param name=IMG_URL. Thanks. Norm
RE: wget: question about tag
Normand Savard wrote: I have a question about wget. Is is possible to download other attribute value other than the harcoded ones? No, at least not in the existing versions of wget. I have not heard that anyone is working on such an enhancement.
wget: question about tag
Hi, I have a question about wget. Is is possible to download other attribute value other than the harcoded ones? For example I have the following html code: ... applet name=RosaApplet archive=./rosa/rosa.jar code=Rosa2000 width=400 height=300 MAYSCRIPT param name=TB_POSITION value=right param name=TB_ALIGN value=top param name=IMG_URL value=/ms_tmp/1107184599245591.gif param name=INP_FORM_NAME value=main ... I want to retrieve the image under param name=IMG_URL. Thanks. Norm
wget operational question
Probably insane question but - is there a way with wget to download the output (as text) and NOT the HTML code? I have a site I want and they are BOLDING the first few letters - and I just want the name without the html tags. So a straight text output would suffice. thanks e.g. with explorer - I bring up the documentation for wget and then save as text - it then saves a text file of the html - but no formatting no html code etc etc GNU Wget ManualGNU Wget The noninteractive downloading utility Updated for Wget 1.8.1, December 2001 by Hrvoje [EMAIL PROTECTED]'{c} and the developers Table of Contents Overview Invoking URL Format Option Syntax Basic Startup Options
Re: wget operational question
% wget -q -O - http://www.gnu.org/software/wget/manual/wget-1.8.1/html_mono/wget.html | html2text | head -15 ** GNU Wget ** * The noninteractive downloading utility * * Updated for Wget 1.8.1, December 2001 * by Hrvoje [EMAIL PROTECTED]'{c} and the developers === ** Table of Contents ** * Overview * Invoking o URL Format o Option Syntax o Basic Startup Options o Logging and Input File Options o Download Options o Directory Options o HTTP Options Of course, sounds like you are using windows; no idea if any of this will work there. Jim On Fri, 1 Oct 2004, Jeff Holicky wrote: Probably insane question but - is there a way with wget to download the output (as text) and NOT the HTML code? I have a site I want and they are BOLDING the first few letters - and I just want the name without the html tags. So a straight text output would suffice. thanks e.g. with explorer - I bring up the documentation for wget and then save as text - it then saves a text file of the html - but no formatting no html code etc etc GNU Wget ManualGNU Wget The noninteractive downloading utility Updated for Wget 1.8.1, December 2001 by Hrvoje [EMAIL PROTECTED]'{c} and the developers Table of Contents Overview Invoking URL Format Option Syntax Basic Startup Options
RE: wget operational question
Thanks Jim. Yes but the command line version. I was thinking a few steps ahead I suppose. The browser apps have an htm2txt type converter built in - technically. I was sort of thinking wget should as well. Was reviewing the docs of this plus cURL (am I allowed to utter that 4 letter word here?) FYI to all: (Windows tools - such as the one described) http://users.erols.com/waynesof/bruce.htm Thanks Jim - you brought me safely down to the lowest level of where I need to be. Cheers! Jeff (If we discuss GUI then yes the number of options is limited but in the command line world there is typically plenty across all platforms) -Original Message- From: Jim Wright [mailto:[EMAIL PROTECTED] Sent: Friday, October 01, 2004 07:23 PM To: Jeff Holicky Cc: [EMAIL PROTECTED] Subject: Re: wget operational question % wget -q -O - http://www.gnu.org/software/wget/manual/wget-1.8.1/html_mono/wget.html | html2text | head -15 ** GNU Wget ** * The noninteractive downloading utility * * Updated for Wget 1.8.1, December 2001 * by Hrvoje [EMAIL PROTECTED]'{c} and the developers === ** Table of Contents ** * Overview * Invoking o URL Format o Option Syntax o Basic Startup Options o Logging and Input File Options o Download Options o Directory Options o HTTP Options Of course, sounds like you are using windows; no idea if any of this will work there. Jim On Fri, 1 Oct 2004, Jeff Holicky wrote: Probably insane question but - is there a way with wget to download the output (as text) and NOT the HTML code? I have a site I want and they are BOLDING the first few letters - and I just want the name without the html tags. So a straight text output would suffice. thanks e.g. with explorer - I bring up the documentation for wget and then save as text - it then saves a text file of the html - but no formatting no html code etc etc GNU Wget ManualGNU Wget The noninteractive downloading utility Updated for Wget 1.8.1, December 2001 by Hrvoje [EMAIL PROTECTED]'{c} and the developers Table of Contents Overview Invoking URL Format Option Syntax Basic Startup Options
question on wget via http proxy
Hello, I am sitting behind a http proxy and need to access the internet through this channel. In most cases this works fine - but there are certain FTP server sites that I can only access via browser or wget. This also is no problem - as long as I need to retrieve data. Problems come up as soon as I need to upload data - this seems to be possible only via netscape 4. All tools that I used (including gftp, kbear, lftp) do not help out. E.g. using gftp I can access ftp.suse.com - but not these sites. As the browser is rahter unreliable in this respect I would like to use another tool. Problem sites are testcase.boulder.ibm.com ftp.software.ibm.com Since wget is able to obtain directoy listings / retrieve data from there is should be possible to also upload data (the browser is able to as well). What is so special about wget that it is able to perform this task ? If I knew, maybe I could find a solution to this problem. I am running LInux SuSE9.0, kernel 2.4.26, wget-1.8.2-301. I have set env variable http_proxy ftp_proxy which make the connection working fine with wget. Any idea ? Thank you Malte Verschicken Sie romantische, coole und witzige Bilder per SMS! Jetzt neu bei WEB.DE FreeMail: http://freemail.web.de/?mc=021193
Re: question on wget via http proxy
Malte Schünemann wrote: Since wget is able to obtain directoy listings / retrieve data from there is should be possible to also upload data Then it would be wput. :-) What is so special about wget that it is able to perform this task? You can learn a LOT about how wget is communicating with the target site by using the --debug argument. Hope that helps a little. Tony
question about wget use or (possible) new feature
Hello, Have tried to use wget to download forum pages. But the point is wget download all such links site.com/forum?topic=5way_to_show=1stway site.com/forum?topic=5way_to_show=2ndway and so on... The point is that all these links have same contents, but different way to show it. Is there any way, which can control "parametres" in link? To filter it and etc, something like --reject for filenames? Thank you. Best regards.Olga Lavhttp://www.dogsempire.com
Question: How do I get wget to get past a form based authenticati on?
I am trying to use wget to spider our company web site to be able to save copies of the site periodically. We moved from web based authentication to form based last year and I can't figure out how to get wget to get past the authenication. Most of our content is behind the authentication. If wget won't work for my last, any other suggestions? Thanks. Imelda Bettinger IT eTechnology AMVESCAP PLC * 713.214.4669 Direct * [EMAIL PROTECTED] - Confidentiality Note: The information contained in this message, and any attachments, may contain confidential and/or privileged material. It is intended solely for the person or entity to which it is addressed. Any review, retransmission, dissemination, or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.
Re: Question: How do I get wget to get past a form based authenticati on?
Bettinger, Imelda [EMAIL PROTECTED] writes: We moved from web based authentication to form based last year and I can't figure out how to get wget to get past the authenication. Most of our content is behind the authentication. By form based authentication I assume you mean that you enter your credentials in a web form, after which the browser session is authenticated? In that case, the authentication information is really carried by the cookie. To get Wget to send it, specify `--load-cookies' on the cookie file exported by the browser, such as Mozilla's cookies.txt. This is explained in the manual under the `--load-cookies' option.