Re: OpenVMS URL
Then your problem isn't with wget. Once you figure out how to access the file in a web browser, use the same URL in wget. Tony - Original Message - From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED] To: Tony Lewis [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 8:41 AM Subject: RE: OpenVMS URL That's the problem I'm having. With all the looking and reading I've done I haven't found a way to specify the type of pathname I used as an example (disk:[directory.subdirectory]filename) as a URL for a broswer or anything else that requires a URL to retrieve things over ftp. -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 11:08 AM To: Bufford, Benjamin (AGRE); [EMAIL PROTECTED] Subject: Re: OpenVMS URL How do you enter the path in your web browser? - Original Message - From: Bufford, Benjamin (AGRE) [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Wednesday, May 26, 2004 7:32 AM Subject: OpenVMS URL I am trying to use wget to retrieve a file from an OpenVMS server but have been unable to make wget to process a path with a volume name in it. For example: disk:[directory.subdirectory]filename How would I go about entering this type of path in a way that wget can understand? ** The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege. If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager. Please do not copy it for any purpose, or disclose its contents to any other person. The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused, directly or indirectly, by any virus transmitted in this email. ** ** The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege. If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager. Please do not copy it for any purpose, or disclose its contents to any other person. The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company. The recipient should check this e-mail and any attachments for the presence of viruses. The company accepts no liability for any damage caused, directly or indirectly, by any virus transmitted in this email. **
Re: trouble with encoded filename
[EMAIL PROTECTED] wrote: Well, I found out a little bit more about the real reason for the problem. Opera has a very convenient option called Encode International Web Addresses with UTF-8. When I had this option checked, it could retrieve the file without problems. Without this option enabled, I get the same forbidden response that I received when I used wget. In my never-humble opinion, wget needs this ability also. I had hoped that using the option --restrict-file-names=nocontrol would have disabled encoding of the URL, but apparently, it does not. Huh? Opera is doing special encoding for some types of web addresses and you hoped that disabling ALL encoding would somehow make wget do the same thing? If special encoding is required: 1) someone has to write the code in wget to perform that encoding and 2) it has to be ENabled (not DISabled). Tony
Re: Problem Accessing FTP Site Where Password Contains @
[EMAIL PROTECTED] wrote: I came across a problem accessing an FTP site where the password contained a @ sign. The password was [EMAIL PROTECTED] So I tried the following: wget -np --server-response -H --tries=1 -c --wait=60 --retry-connrefused -R * ftp://guest:[EMAIL PROTECTED]@83.21.191.254:21/document.rar Try ftp://guest:1nDi:[EMAIL PROTECTED]:21/document.rar In other words: username COLON password AT-SIGN address
Re: not downloading at all, help
Juhana Sadeharju wrote: I placed use_proxy = off to .wgetrc (which file I did not have earlier) and to ~/wget/etc/wgetrc (which file I had), and tried wget --proxy=off http://www.maqamworld.com and it still does not work. Could there be some system wgetrc files somewhere? I have compiled wget on my own to my home directory, and certainly wish that my own installation does not use files of some other installation. Why did you think the :80 comes from proxy? I have always thought it comes from the target site, not from our site. Did you try the given command yourself and it worked? Please try now if you did not. If wget puts the :80 , then how do I instruct wget to not do that no matter what is told somewhere? What part of the source code I should edit if that is only what helps? Though, you should fix this to the wget source because something is not working now. I wonder why this not working is set as a default behaviour to wget... In the communications world where two computers are talking to one another, there is no such thing as http://www.maqamworld.com or http://www.maqamworld.com:80. Those are simply convenient (and readable) notations for the human beings that use the computers. Run wget with the -d option and you will see how the computers break all that down: DEBUG output created by Wget 1.9-beta1 on linux-gnu. --08:23:18-- http://www.maqamworld.com/ = `index.html' Resolving www.maqamworld.com... 66.48.76.90 Caching www.maqamworld.com = 66.48.76.90 Connecting to www.maqamworld.com[66.48.76.90]:80... connected. Created socket 3. Releasing 0x81164d8 (new refcount 1). ---request begin--- GET / HTTP/1.0 User-Agent: Wget/1.9-beta1 Host: www.maqamworld.com Accept: */* Connection: Keep-Alive wget does a domain name look up on www.maqamworld.com and finds that it resides at 66.48.76.90. It then opens a socket to that IP address on port 80. (Port 80 is the default port for the HTTP protocol specified in the first part of the URL -- it's what is used by almost all web sites and browsers). Now that the connection is made between your computer and the server, wget sends a GET request (part of the HTTP protocol) to the server. Included in that request is the name of the site being retrieved Host: www.maqamworld.com, but the port number is never sent by wget to the server. By the way, when I ran this, wget created an index.html file that looks reasonable to me. It is 23,335 bytes long and is identical to what I get if I do a View Source in my browser and save the text file. Run the following command and send the output to the list if you continue to have problems: wget http://www.maqamworld.com -d Tony
Re: Startup delay on Windows
Hrvoje Niksic wrote: Does anyone have an idea what we should consider the home dir under Windows, and how to find it? On Windows 2000 and XP, there are two environment variables that together provide the user's home directory. (It may go back further than that, but I don't have any machines running older OS versions to confirm that.) For example, on my Windows XP machine, I have to following variables: HOMEDRIVE=C: HOMEPATH=\Documents and Settings\Tony Lewis so my home directory is C:\Documents and Settings\Tony Lewis HTH, Tony
Re: passing a login and password
robi sen wrote: Hi I have a client who basically needs to regularly grab content from part of their website and mirror or it and or save it so they can disseminate it as HTML on a CD. The website though is written in ColdFusion as requires application level authentication which is just form vars passed to the system called login and one called password. Is there a way to do this with wget? If not I suspect I can add to the applications security something that looks for the login and password in the URL then I could just make sure to append that to the URL that is given to wget. Later versions of wget support posting of forms. Try: wget http://www.yourclient.com/somepage.html --http-post=login=userpassword=pw Tony
Re: IPv6 support of wget v 1.9.1
Kazu Yamamoto wrote: Since I have experiences to modify IPv4 only programs, including FTP and HTTP, to IPv6-IPv4 one, I know this problem. Yes, some part of wget *would* remain protocol dependent. Kazu, it's been said that a picture is worth a thousand words. Perhaps in this case, a patch would make your point better. Happy New Year to all! Tony
Re: need help
Anurag Jain wrote: downloading a bin big file(268MB) using wget command on our solrise box using wget http url/bin filename which located on some webserver it start downloading it and after 42% it give a msg no disk space available and it get stopped. although i check on sever lot more free space is available. How much free space is available on the partition where wget is writing the file? That's the most likely place that you're running out of free space. Tony
Re: IPv6 support of wget v 1.9.1
Kazu Yamamoto wrote: Thank you for supporting IPv6 in wget v 1.9.1. Unfortunately, wget v 1.9.1 does not work well, at least, on NetBSD. NetBSD does not allow to use IPv4-mapped IPv6 addresses for security reasons. To know the background of this, please refer to: http://www.ietf.org/internet-drafts/draft-cmetz-v6ops-v4mapped-api-harmful-01.txt I don't pretend to know much about IPv6, but the document you're quoting is an Internet Draft that says, we don't think you should implement part of RFC 3493. However, according to RFC 3493, the recommendations it makes have been incorporated into POSIX: IEEE Std. 1003.1-2001 Standard for Information Technology -- Portable Operating System Interface (POSIX). Open Group Technical Standard: Base Specifications, Issue 6, December 2001. ISO/IEC 9945:2002. http://www.opengroup.org/austin It seems to me that whether RFC 3493 is correct or not is something that should be fought in the standards bodies, not in applications like wget. Tony
Re: IPv6 support of wget v 1.9.1
YOSHIFUJI Hideaki wrote: NetBSD etc. is NOT RFC compliant here, however, it would be better if one supports wider platforms / configurations. My patch is quick hack'ed, but I believe that it should work for NetBSD and FreeBSD 5. Please consider applying it. It's not my call as to whether your patch gets applied or not, but it seems to me that anything that supports systems that are not RFC compliant should be enabled from a command line switch unless you can support those systems without having a negative impact on compliant systems. Perhaps your patch does that, but I don't know enough about IPv6 to try to make that assessment. Tony
Re: fork_to_background() on Windows
Gisle Vanem wrote: I've searched google and the only way AFAICS to get redirection in a GUI app to work is to create 3 pipes. Then use a thread (or run_with_timeout with infinite timeout) to read/write the console handles to put/get data into/from the parent's I/O handles. I don't fully understand how yet, but it could get messy. Just for the sake of running Wget in the background, it doesn't seem to be worth. Unless someone else have a better idea. I agree. One can always open new command windows for wget or another application to run in. Tony
Re: wget Suggestion: ability to scan ports BESIDE #80, (like 443) Anyway Thanks for WGET!
- Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, December 07, 2003 8:04 AM Subject: wget Suggestion: ability to scan ports BESIDE #80, (like 443) Anyway Thanks for WGET! What's wrong with wget https://www.somesite.com ?
Re: question
Danny Linkov wrote: I'd like to download recursively the content of a web directory WITHOUT AN INDEX file. What shows up in your web browser if you enter the directory (such as http://www.somesite.com/dir/)? The most common responses are: * some HTML file selected by the server (often index.html, but not always) * an HTML listing of the directory contents generated by the server * a 403 Forbidden response In the first two cases, you can use: wget http://www.somesite.com/dir/ --mirror In the third case, you cannot grab the entire directory using wget. You will have to construct a list of filenames in a file and then use: wget --input-file=FILE Note that the files that are retrieved in this fashion will appear in the current directory. Hope that helps. By the way, there is a newer version of wget available (although my answer is the same for 1.8.2 and 1.9). Tony
Re: a problem on wgetting a PNG image
[EMAIL PROTECTED] wrote: I am not sure if this is a bug, but it's really out of my expectation. Here is the way to reproduce the problem. 1. Put the URL http://ichart.yahoo.com/b?s=CSCO into the browser and then drag out the image. It should be a file with .png extension. So I believe this is PNG. (is it?!?) 2. wget -O csco.img http://ichart.yahoo.com/b?s=CSCO; Originally, I thought that the files obtained in either way should be the same. But the fact is that they are of different sizes. From the 2nd file's header, it's a gif instead. This is the bigger file. Does it mean that wget has transformed the original file? Or did I miss any step? The server changes its behavior based on the user agent. The following will get back an image/gif: wget http://ichart.yahoo.com/b?s=CSCO while the following will get back an image/png: wget http://ichart.yahoo.com/b?s=CSCO -U Mozilla/4.0 (compatible) Tony
Re: can you authenticate to a http proxy with a username that contains a space?
antonio taylor wrote: http://fisrtname lastname:[EMAIL PROTECTED] Have you tried http://fisrtname%20lastname:[EMAIL PROTECTED] ?
Re: feature request: --second-guess-the-dns
Hrvoje Niksic wrote: Have you seen the rest of the discussion? Would it do for you if Wget correctly handled something like: wget --header='Host: jidanni.org' http://216.46.192.85/ I think that is an elegant solution. Tony
Re: Does HTTP allow this?
Hrvoje Niksic wrote: Assume that Wget has retrieved a document from the host A, which hasn't closed the connection in accordance with Wget's keep-alive request. Then Wget needs to connect to host B, which is really the same as A because the provider uses DNS-based virtual hosts. Is it OK to reuse the connection to A to talk to B? snip FWIW, it works fine with Apache. There is a fairly high probability that it will work with most hosts (regardless of the server software). If an IP address has been registered with multiple hosts, then the address alone is not sufficient to retrieve a resource so you have to add a Host header. It's possible that the server responding to the IP address forwards connections to multiple backend servers. These backend servers may or may not know about all the resources that the gateway server know about. Since it will work most of the time, I think it's a reasonable optimization to use, however you might want to add a --one-host-per-connection flag for the rare cases where the current behavior won't work. Tony
Re: Does HTTP allow this?
Hrvoje Niksic wrote: The thing is, I don't want to bloat Wget with obscure options to turn off even more obscure (and *very* rarely needed) optimizations. Wget has enough command-line options as it is. If there are cases where the optimization doesn't work, I'd rather omit it completely. It's probably safest to turn off that optimization even if it does eliminate a few opens now and then. Tony
Re: The patch list
Hrvoje Niksic wrote: I'm curious... is anyone using the patch list to track development? I'm posting all my changes to that list, and sometimes it feels a lot like talking to myself. :-) I read the introductory stuff to see what's changed, but I never extract the patches from the messages. From my perspective, the introductory stuff plus a list of affected files would be sufficient. Tony
Re: Wget 1.8.2 bug
Hrvoje Niksic wrote: Incidentally, Wget is not the only browser that has a problem with that. For me, Mozilla is simply showing the source of http://www.minskshop.by/cgi-bin/shop.cgi?id=1cookie=set, because the returned content-type is text/plain. On the other hand, Internet Explorer will treat lots of content types as HTML if the content starts with html. To see for yourself, try these links: http://www.exelana.com/test.cgi http://www.exelana.com/test.cgi?text/plain http://www.exelana.com/test.cgi?image/jpeg Perhaps we can add an option to wget so that it will look for an html tag in plain text files? Tony
Re: wget downloading a single page when it should recurse
Philip Mateescu wrote: A warning message would be nice when for not so obvious reasons wget doesn't behave as one would expect. I don't know if there are other tags that could change wget's behavior (like -r and meta name=robots do), but if they happen it would be useful to have a message. I agree that this is worth a notable mention in the wget output. At the very least, running with -d should provided more guidance on why the links it has appended to urlpos are not being followed. Buried in the middle of hundreds of lines of output is: no-follow in index.php On the other hand, if other rules prevent a URL from being followed, you might see something like: Deciding whether to enqueue http://www.othersite.com/index.html;. This is not the same hostname as the parent's (www.othersite.com and www.thissite.com). Decided NOT to load it. Tony
Re: Wget 1.9 about to be released
Hrvoje Niksic wrote: I'm about to release 1.9 today, unless it takes more time to upload it to ftp.gnu.org. If there's a serious problem you'd like fixed in 1.9, speak up now or be silent until 1.9.1. :-) I thought we were going to turn our attention to 1.10. :-)
POST followed by GET
I'm trying to figure out how to do a POST followed by a GET. If I do something like: wget http://www.somesite.com/post.cgi --post-data 'a=1b=2' http://www.somesite.com/getme.html -d I get the following behavior: POST /post.cgi HTTP/1.0 snip [POST data: a=1b=2] snip POST /getme.html HTTP/1.0 snip [POST data: a=1b=2] Is this what is expected? Is there a way I can coax wget to POST to post.cgi and GET getme.html? Tony
Re: POST followed by GET
Hrvoje Niksic wrote: Maybe the right thing would be for `--post-data' to only apply to the URL it precedes, as in: wget --post-data=foo URL1 --post-data=bar URL2 URL3 snip But I'm not at all sure that it's even possible to do this and keep using getopt! I'll start by saying that I don't know enough about getopt to comment on whether Hrvoje's suggestion will work. It's hard to imagine a situation where wget's current behavior makes sense over multiple URLs. I'm sure someone can come up with an example, but it's likely to be an unusual case. I see the ability to POST a form as being most useful when a site requires some kind of form-based authentication to proceed with looking at other pages within the site. Some alternatives that occur to me follow. Alternative #1. Only apply --post-data to the first URL on the command line. (A simple solution that probably covers the majority of cases.) Alternative #2. Allow POST and GET as keywords in the URL list so that: wget POST http://www.somesite.com/post.cgi --post-data 'a=1b=2' GET http://www.somesite.com/getme.html would explicitly specify which URL uses POST and which uses GET. If more than one POST is specified, all use the same --post-data. Alternative #3. Look for form tags and have --post-file specify the data to be specified to various forms: --form-action=URL1 'a=1b=2' --form-action=URL2 'foo=bar' Alternative #4. Allow complex sessions to be defined using a session file such as: wget --session=somefile --user-agent='my robot' Options specified on the command line apply to every URL. If somefile contained: --post-data 'data=foo' POST URL1 --post-data 'data=bar' POST URL2 --referer=URL3 GET URL4 It would be the same logically equivalent to the following three commands: wget --user-agent='my robot' --post-data 'data=foo' POST URL1 wget --user-agent='my robot' --post-data 'data=bar' POST URL2 wget --user-agent='my robot' --referer=URL3 GET URL4 with wget's state maintained across the session. Tony
Re: POST followed by GET
Hrvoje Niksic wrote: I like these suggestions. How about the following: for 1.9, document that `--post-data' expects one URL and that its behavior for multiple specified URLs might change in a future version. Then, for 1.10 we can implement one of the alternative behaviors. That works for me... I can hardly wait for 1.9 to get wrapped up so we can start working on 1.10. Hrvoje, has anyone mentioned how glad we are that you've come back? Tony
Re: How do you pronounce Hrvoje?
Hrvoje and I have had an off-list dialogue about this subject. We've settled on HUR-voy-eh as the closest phonetic rendition of his name for English speakers. It helps to remember that the r is rolled. Tony
How do you pronounce Hrvoje?
I've been on this list for a couple of years now and I've always wondered how our illustrious leader pronounces his name. Can you give us linguistically challenged Americans a phonetic rendition of your name? Tony Lewis (toe knee loo iss)
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to @code{--post-file} must be a regular file; specifying a FIFO or something like @file{/dev/stdin} won't work. There's nothing that says you have to read the data after you've started sending the POST. Why not just read the --post-file before constructing the request so that you know how big it is? My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. How do browsers figure out whether they can do a chunked transfer or not? Tony
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). I was proposing that you read the file to determine the length, but that was on the assumption that you could read the input twice, which won't work with the example you proposed. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin Stefan Eissing wrote: I just checked with RFC 1945 and it explicitly says that POSTs must carry a valid Content-Length header. In that case, Hrvoje will need to get creative. :-) Can you determine if --post-file is a regular file? If so, I still think you should just read (or otherwise examine) the file to determine the length. For other types of input, perhaps you want write the input to a temporary file. Tony
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: That would work for short streaming, but would be pretty bad in the mkisofs example. One would expect Wget to be able to stream the data to the server, and that's just not possible if the size needs to be known in advance, which HTTP/1.0 requires. One might expect it, but if it's not possible using the HTTP protocol, what can you do? :-)
Re: Web page source using wget?
Suhas Tembe wrote: 1). I go to our customer's website every day log in using a User Name Password. [snip] 4). I save the source to a file subsequently perform various tasks on that file. What I would like to do is automate this process of obtaining the source of a page using wget. Is this possible? That depends on how you enter your user name and password. If it's via using an HTTP user ID and password, that's pretty easy. wget http://www.custsite.com/some/page.html --http-user=USER --http-passwd=PASS If you supply your user ID and password via a web form, it will be tricky (if not impossible) because wget doesn't POST forms (unless someone added that option while I wasn't looking. :-) Tony
Re: Option to save unfollowed links
Hrvoje Niksic wrote: I'm curious: what is the use case for this? Why would you want to save the unfollowed links to an external file? I use this to determine what other websites a given website refers to. For example: wget http://directory.google.com/Top/Regional/North_America/United_States/California/Localities/H/Hayward/ - -mirror -np --unfollowed-links=hayward.out By looking at hayward.out, I have a list of all websites that the directory refers to. When I use this file, I sort it and throw away the Google and DMOZ links. Everything else is supposed to be something interesting about Hayward. Tony
Re: Reminder: wget has no maintainer
Daniel Stenberg wrote: The GNU project is looking for a new maintainer for wget, as the current one wishes to step down. I think that means we need someone who: 1) is proficient in C 2) knows Internet protocols 3) is willing to learn the intricacies of wget 4) has the time to go through months' worth of email and patches 5) expects to have time to continue to maintain wget Anyone here think they fit that bill? (Feel free to add to my suggestions about what kind of person we need.) Tony
Re: wget problem
Rajesh wrote: Wget is not mirroring the web site properly. For eg it is not copying symbolic links from the main web server.The target directories do exist on the mirror server. wget can only mirror what can be seen from the web. Symbolic links will be treated as hard references (assuming that some web page points to them). If you cannot get there from http://www.sl.nsw.gov.au/ via your browser, wget won't get the page. Also, some servers change their behavior depending on the client. You may need to use a user agent that looks like a browser to mirror some sites. For example: wget --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1) will make it look like wget is really Internet Explorer running on Windows XP. Another problem is some of the files are different on the mirror web server. her you again. For eg: compare these 2 attached files. penrith1.cfm is the file after wget copied from the main server. penrith1.cfm.org is the actual file sitting on the main server. wget is storing what the web server returned, which may or may not be the precise file stored on your system. In particular, I notice that penrith1.cfm contains !--Requested: 17:30:40 Thursday 3 July 2003 --. That implies that all or part of the output is generated programmatically. You might try using wget to replicate an FTP version of the website. Then again, perhaps wget is the wrong tool for your task. Have you considered using secure copy (scp) instead? HTH, Tony
Re: wget problem
Rajesh wrote: Thanks for your reply. I have tried using the command wget --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1), but it didn't work. Adding the user agent helps some people -- I think most often with web servers from the evil empire. I have one more question. In each directory I have a welcome.cfm file on the main server (DirectoryIndex order is welcome.cfm welcome.htm welcome.html index.html). But, when I run wget on the mirror server, wget renames welcome.cfm to index.html and downloads to mirror server. Why does it change the file name from welcome.cfm to index.html. It appears to me that wget assumes that the result of getting a directory (such as http://www.sl.nsw.gov.au/collections/) is index.html. (See the debug output below.) How can I mirror a web site using scp?? I can only copy one file at a time using scp. The following works for me: scp [EMAIL PROTECTED]:path/to/directory/* -r ** The promised debug output: wget http://www.sl.nsw.gov.au/collections --debug DEBUG output created by Wget 1.8.1 on linux-gnu. --20:16:36-- http://www.sl.nsw.gov.au/collections = `collections' Resolving www.sl.nsw.gov.au... done. Caching www.sl.nsw.gov.au = 192.231.59.40 Connecting to www.sl.nsw.gov.au[192.231.59.40]:80... connected. Created socket 3. Releasing 0x810dc38 (new refcount 1). ---request begin--- GET /collections HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.sl.nsw.gov.au Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 301 Moved Permanently Date: Fri, 04 Jul 2003 03:16:36 GMT Server: Apache/1.3.19 (Unix) Location: http://www.sl.nsw.gov.au/collections/ Connection: close Content-Type: text/html; charset=iso-8859-1 Location: http://www.sl.nsw.gov.au/collections/ [following] Closing fd 3 --20:16:37-- http://www.sl.nsw.gov.au/collections/ = `index.html' Found www.sl.nsw.gov.au in host_name_addresses_map (0x810dc38) Connecting to www.sl.nsw.gov.au[192.231.59.40]:80... connected. Created socket 3. Releasing 0x810dc38 (new refcount 1). ---request begin--- GET /collections/ HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.sl.nsw.gov.au Accept: */* Connection: Keep-Alive ---request end--- HTTP request sent, awaiting response... HTTP/1.1 200 OK Date: Fri, 04 Jul 2003 03:16:37 GMT Server: Apache/1.3.19 (Unix) Connection: close Content-Type: text/html; charset=iso-8859-1 Length: unspecified [text/html] [ ] 21,28420.83K/s Closing fd 3 20:16:38 (20.83 KB/s) - `index.html' saved [21284]
wget is smarter than Internet Explorer!
I tried to retreive a URL with Internet Explorer and it continued to retrieve the URL forever. I tried to grab that same URL with wget, which tried twice and then reported redirection cycle detected. Perhaps we should send the wget code to someone in Redmond. Tony
Re: Comment handling
Aaron S. Hawley wrote: why not just have the default wget behavior follow comments explicitly (i've lost track whether wget does that or needs to be ammended) /and/ have an option that goes /beyond/ quirky comments and is just --ignore-comments ? :) The issue we've been discussing is what to do about things that almost follow the rules for HTML comments, but don't quite get it right. By default, wget ignores legitimate HTML comments. Tony
Re: Comment handling
Aaron S. Hawley wrote: i'm just saying what's going to happen when someone posts to this list: My Web Pages have [insert obscure comment format] for comments and Wget is considering them to (not) be comments. Can you change the [insert Wget comment mode] comment mode to (not) recognize my comments? One way to implement quirky comments is to allow the user to add their own comment format to the wgetrc file. Tony
Re: Comment handling
Georg Bauhaus wrote: I don't think so. Actually the rules for SGML comments are somewhat different. Georg, I think we're talking about apples and oranges here. I'm talking about what is legitimate in a comment in an SGML document. I think you're talking about what is legitimate as a comment in an SGML declaration. At any rate, I decided to do some more poking around. I wrote a web page (see http://www.exelana.com/comments.html) with the following variations on comments: !-- Comment -- !-- -- -- ! ! The browsers I tried (Internet Explorer, Mozilla, and Lynx) ignore all of them. I also tried the W3C Markup Validation Service at http://validator.w3.org/ It reported that the last one is not valid: Line 22 column 8: comment started here ! http://validator.w3.org/check?uri=http%3A%2F%2Fwww.exelana.com%2Fcomments.htmldoctype=HTML+2.0charset=us-ascii+%28basic+English%29 The moral of the story: one cannot evaluate an HTML document solely on what any browser (or even all of them) do with it. Tony
Re: Comment handling
George Prekas wrote: You are probably right. I have pointed this because I have seen pages that use as a separator !-- with lots of dashes and althrough Internet Explorer shows the page, wget can not download it correctly. What do think about finishing the comment at the ? After reading http://www.w3c.org/MarkUp/SGML/sgml-lex/sgml-lex I am convinced that !- is a valid SGML (and therefore HTML) comment. Therefore, I believe it is a bug if wget does not recognize such a comment. Note: I haven't studied the source to confirm how it handles such a string. Tony
Re: Comment handling
George Prekas wrote: I have found a bug in Wget version 1.8.2 concerning comment handling ( !-- comment -- ). Take a look at the following illegal HTML code: HTML BODY a href=test1.htmltest1.html/a !-- a href=test2.htmltest2.html/a !-- /BODY /HTML Now, save the above snippet as test.html and try wget -Fi test.html. You will notice that it doesn't recognise the second link. I have found a solution to the above situation and have properly patched html-parse.c and I would like some info on how can I give you the patch. The HTML code is legitimate, but it only contains one link. The following three lines constitute a single comment: !-- a href=test2.htmltest2.html/a !-- A comment begins at !-- and ends at --. The trailing on the first of these lines and the leading ! on the third of these lines are part of the comment. That is, the comment text is: a href=test2.htmltest2.html/a ! At any rate, one should not expect predictable behavior for broken HTML. What should wget do with the following? a href=test1.htmltest1.html !-- /a !-- In one version, it might choose to follow the link to test1.html and in another version it might not. Tony
Re: Cannot get wildcards to work ??
Dick Penny wrote: I have just successfully used WGET on a single file download. I even figured out how to specify a destination. But, I cannot seem to get wildcards to work. Help please: wget -o log.txt -P c:/Documents and Settings/Administrator/My Documents/CME_data/bt ftp://ftp.cme.com/pub/bulletin/historical_data/bt02100?.zip You requested the resource bt02100 with a query string of .zip. You might just as easily have asked for bt02100.cgi?extension=zip. When it appears in a URL, the question mark is not a wildcard character; it is a separator between the resource and the query string. Chances are very good that the server doesn't have a clue how to process such a request. Tony
Re: Static Mirror of DB-Driven Site
Dan Mahoney, System Admin wrote: Assume I have a site that I want to create a static mirror of. Normally this site is database driven, but I figure if I spider the entire site, and map all the GET URLS to static urls I can have a full mirror. Has anyone known of this being successfully done? How would I get apache to see the page names as full names (for example a page named exec.pl?name=blahfoo=bar actually being a file rather than a command?) Wget should already do what you want (provided that the file system where you will be mirroring the results can handle things like ?, =, and in a file name). Wget does not care how Apache processes a URL; it only cares that when it does a GET of a URL that some object is returned. The issue for you will be making sure that all the things you want to mirror are referenced as links on the site. How does a person visiting your site know that blah is a valid value for name or that bar is a valid value for foo? If they learn this by clicking on a link, then everything should work as you want. However, if the user must supply the value for name and foo (perhaps by entering them in a form) then there is no way for wget to know those values. If that is the case, you will have to construct your own list of URLs with all the combinations of name and foo that you want to mirror. HTH. Tony
Re: conditional url encoding
Ryan Underwood wrote: It seems that some servers are broken and in order to fetch files with certain filenames, some characters that are normally encoded in HTTP sequences must be sent through unencoded. For example, I had a server the other day that I was fetching files from at the URL: http://server.com/~foobar/files I'm having a hard time figuring out why wget is encoding the tilde in the first place. They way I read RFC 2396, tilde is one of several marks that are not encoded. The complete set of marks defined in RFC 2396 is -_.!~*'(). Perhaps the encoding rules in wget were written prior to the publication of RFC 2396 and are based on the national character discussion of RFC 1630. If so, tilde is the only character that was defined as national in RFC 1630 and as a mark in RFC 2396. For what it's worth, the national characters in RFC 1630 are {}|[]\^~. Tony
Re: image tags not read
Johannes Berg wrote: Maybe this isn't really a bug in wget but rather in the file, but since this is standard as exported from MS Word I'd like to see wget recognize the images and download them. Microsoft Word claims to create a valid HTML file. In fact, what it creates can only reliably be read by Internet Explorer. (It may even only be read by recent versions of Internet Explorer.) The file that it produces contains a number of proprietary tags as well as proprietary variations of standard HTML that only Microsoft understands. wget has a simple HTML parser that cannot understand these variations. While there may be someone who is interested in patching the wget parser to deal with Word's pseudo-HTML, I doubt that such changes would ever become part of a standard wget release. You might have better luck finding someone who is willing to write a program to convert Word's pseudo-HTML into real HTML that can be read by most HTML parsers. Since you're in an academic setting, your odds of finding someone willing to do this kind of program might be higher. Good luck. Tony
Re: ralated links in javascripts script
cyprien wrote: I want to mirror my homesite, everything works fine expect one : my site is a photo site based on php scipts : gallery (http://gallery.sourceforge.net) it have also some javascripts script... [snip] what can i do to have that (on mirror site) : You cannot because wget does not parse JavaScript. It only finds links in the HTML. Tony
Re: newbie doubts
Nandita Shenvi wrote: I have not copied the whole script but just the last few lines.The variable $all_links[3] has an URL: http://bolinux39.europe.nokia.com/database2/MIDI100/GS001/01FINALC.MID. the link follows a file, which I require. I remove the http:// before calling the wget, but i still get an error message: --13:56:24-- http://%20bolinux39.europe.nokia.com/database2/MIDI100/GS001/01FINALC.MID%0A = `wgetcheck' $all_links[3] needs to be cleaned up. It contains as trailing \n and there is a space between http://; and bolinux39 that should not be there. The \n is easily addressed by: chomp @all_links; You should look at your script to determine how the space got there. You can get rid of all spaces by: $all_links[3] =~ s/ //g; but that may not be what you want. You're better off figuring out how the unwanted space got there in the first place and making sure it doesn't. Tony
Re: Virus messages .....
Frank Helk wrote: Free (web based) scanning is available at http://www.antivirus.com. Select Free tools in the top menu and then Scan Your PC, Free from the list. You'll not even have to register to use it. Please. It may not be so simple. Klez uses anti-anti-virus techniques to prevent itself from being detected or deleted. It took me several days to figure out how to eradicate it from my computer. The only way I was able to get rid of it was by booting my computer in Safe mode and installing the anti-virus software from a CD. If one runs something that has been downloaded from the Internet, there is a strong possibility that Klez will infect it before it can do its detection or cleaning. I agree with Frank. If you're running Windows (particularly if you're using Outlook and/or have opened attachments sent to wget lists recently), you need to scan (and probably clean) your computer. Tony
Re: (Extended) Reading commandline option values from files or file descriptors (for wget v1.8.1)
Herold Heiko wrote: It would be better imho if the options itself are modified, in that case the variable option wouldn't be necessary, supposing we keep the and :, this could be --@http-passwd=passwd.txt --:proxy-passwd=0 It seems to me that a convention like this should be adopted (or rejected) across a wide range of applications. Is there a GNU-wide mailing list where this could be proposed and discussed? Tony
Re: Virus mails
Brix Lichtenberg wrote: But I'm still getting three or more virus mails with attachments 100k+ daily from the wget lists and they're blocking my mailbox (dial-up). And getting those dumb system warnings accompanying them doesn't make it better. Isn't there really no way to stop that (at least disallow attachments)? Patches and such still can be pasted into the text, can't they? I agree. Why not treat any mail with an attachment as suspect. Let the moderators approve any valid messages. Tony
Re: wget does not honour content-length http header [http://bugs.debian.org/143736]
Hrvoje Niksic wrote: If your point is that Wget should print a warning when it can *prove* that the Content-Length data it received was faulty, as in the case of having received more data, I agree. We're already printing a similar warning when Last-Modified is invalid, for example. I'm afraid you'll have to ask R. Fielding, J. Gettye, J. Mogul, H. Frystyk, and T. Berners-Lee what they were thinking. grin I was just quoting from RFC 2068: Hypertext Transfer Protocol -- HTTP/1.1 As for printing a warning only when wget can prove that the Content-Length data was faulty, sounds like a reasonable implementation to me. Tony
Re: apache irritations
Maciej W. Rozycki wrote: Hmm, it's too fragile in my opinion. What if a new version of Apache defines a new format? I think all of the expressions proposed thus far are too fragile. Consider the following URL: http://www.google.com/search?num=100q=%2Bwget+-GNU The regular expression needs to account for multiple arguments separated by ampersands. It also needs to account from any valid URI character between an equal sign and either end of string or an ampersand. I'm not fluent enough in regular expressions to compose one myself. (Some day I'll absorb all of Friedl's Mastering Regular Expressions, but not today.) Tony
Re: apache irritations
Maciej W. Rozycki wrote: I'm not sure what you are referring to. We are discussing a common problem with static pages generated by default by Apache as index.html objects for server's filesystem directories providing no default page. Really? The original posting from Jamie Zawinski said: I know this would be somewhat evil, but can we have a special case in wget to assume that files named ?N=D and index.html?N=D are the same as index.html? I'm tired of those dumb apache sorting directives showing up in my mirrors as if they were real files... I understood the question to be about URLs containing query strings (which Jamie called sorting directives) showing up as separate files. I thought the discussion was related to that topic. Maybe it diverged from that later in the chain and I missed the change of topic. I think what Jamie wants is one copy of index.html no matter how many links of the form index.html?N=D appear. BTW, wget's accept/reject rules are not regular expressions but simple shell globbing patterns. OK. Tony
Re: HTTP 1.1
Hrvoje Niksic wrote: Is there any way to make Wget use HTTP/1.1 ? Unfortunately, no. In looking at the debug output, it appears to me that wget is really sending HTTP/1.1 headers, but claiming that they are HTTP/1.0 headers. For example, the Host header was not defined in RFC 1945, but wget is sending it. Tony
Re: Current download speed in progress bar
Hrvoje Niksic wrote: The one remaining problem is the ETA. Based on the current speed, it changes value wildly. Of course, over time it is generally decreasing, but one can hardly follow it. I removed the flushing by making sure that it's not shown more than once per second, but this didn't fix the problem of unreliable values. I'm often annoyed by ETA estimates that make no sense. How about showing two values -- something like: ETA at average speed: 1:05:17 ETA at current speed: 15:05 Then the user can decide which value is more meaningful. In addition, it gives feedback about the current speed versus the average. Tony
Re: Current download speed in progress bar
Hrvoje Niksic wrote: I'll grab the other part and explain what curl does. It shows a current speed based on the past five seconds, Does it mean that the speed doesn't change for five seconds, or that you always show the *current* speed, but relative to the last five seconds? I may be missing something, but I don't see how to efficiently implement the latter. Could you keep an array of speeds that is updated once a second such that the value from six seconds ago is discarded and when the value for the second that just ended is recorded? Tony
Re: Referrer Faking and other nifty features
Andre Majorel wrote: Yes, that allows me to specify _A_ referrer, like www.aol.com. When I'm trying to help my users mirror their old angelfire pages or something like that, very often the link has to come from the same directory. I'd like to see something where when wget follows a link to another page, or another image, it automatically supplies the URL of the page it followed to get there. Is there a way to do this? Somebody already asked for this and AFAICT, there's no way to do that Not only is it possible, it is the behavior (at least in wget 1.8.1). If you run with -d, you will see that every GET after the first one includes the appropriate referer. If I execute: wget -d -r http://www.exelana.com --referer=http://www.aol.com The first request is reported as: GET / HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.exelana.com Accept: */* Connection: Keep-Alive Referer: http://www.aol.com But, the third request is: GET /left.html HTTP/1.0 User-Agent: Wget/1.8.1 Host: www.exelana.com Accept: */* Connection: Keep-Alive Referer: http://www.exelana.com/ The second request is for robots.txt and uses the referer from the command line. Tony
Re: wget parsing JavaScript
Ian Abbott wrote: For example, a recursive retrieval on a page like this: html body script a href=foo.htmlfoo/a /script /body /html will retrieve foo.html, regardless of the script.../script tags. We seem to be talking about two completely different things, Ian. A page that looks like this: html head script top.location = foo.html; /script /head body This page transfers to foo. /body /html won't retrieve foo.html. That's what I have been trying to get across. Tony
Re: wget parsing JavaScript
Csaba Ráduly wrote: I see that wget handles SCRIPT with tag_find_urls, i.e. it tries to parse whatever it's inside. Why was this implemented ? JavaScript is most used to construct links programmatically. wget is likely to find bogus URLs until it can properly parse JavaScript. wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. It looks for src=url because the source file is just another file that may need to be copied (along with all the other files that are needed to mirror a site). Tony
Re: wget parsing JavaScript
I wrote: wget is parsing the attributes within the script tag, i.e., script src=url. It does not examine the content between script and /script. and Ian Abbott responded: I think it does, actually, but that is mostly harmless. You're right. What I meant was that it does not examine the JavaScript looking for URLs. Tony
Re: (Fwd) Automatic posting to forms
Daniel Stenberg responded to my original suggestion: With this information, any time that wget encounters a form whose action is /cgi-bin/auth.cgi, it will enqueue the submission of the form using the values provided for the fields id and pw. Now, why would wget do this? There are many examples of sites that require the user to post a form to access other parts of the site -- sometimes the post contains user-supplied data and sometimes it doesn't. If one wants to grab everything on the other side of that form, having wget post the form seems like the way to get there. Yes, probably: when the form tags contains enctype='multipart/form-data' you need to build an entirely different data stream (RFC1867 is the key here). You're right. I had not yet thought about that flavor of posting. I'd also like to point out that curl already supports both regular HTTP POST as well as multipart formposts. Unless I'm misreading the curl manual, that only allows me to get one page. However, I've never been inclined to invent wheels when I can download them. I will study the curl source code related to posting before I wander too far down this path. Tony