Re: Using chunked transfer for HTTP requests?
Theoretically, a HTTP/1.0 server should accept an unknown content-length if the connection is closed after the request. Unfortunately, the response 411 Length Required, is only defined in HTTP/1.1. //Stefan Am Dienstag, 07.10.03, um 01:12 Uhr (Europe/Berlin) schrieb Hrvoje Niksic: As I was writing the manual for `--post', I decided that I wasn't happy with this part: Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to @code{--post-file} must be a regular file; specifying a FIFO or something like @file{/dev/stdin} won't work. My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. The thing is, to be certain that you can use chunked transfer, you have to know you're dealing with an HTTP/1.1 server. But you can't know that until you receive a response. And you don't get a response until you've finished sending the request. A chicken-and-egg problem! Of course, once a response is received, we could remember that we're dealing with an HTTP/1.1 server, but that information is all but useless, since Wget's `--post' is typically used to POST information to one URL and exit. Is there a sane way to stream data to HTTP/1.0 servers that expect POST?
Re: Using chunked transfer for HTTP requests?
On Tue, 7 Oct 2003, Hrvoje Niksic wrote: My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. The thing is, to be certain that you can use chunked transfer, you have to know you're dealing with an HTTP/1.1 server. But you can't know that until you receive a response. And you don't get a response until you've finished sending the request. A chicken-and-egg problem! The only way to deal with this automaticly, that I can think of, is to use a Expect: 100-continue request-header and based on the 100-response you can decide if the server is 1.1 or not. Other than that, I think a command line option is the only choice. -- -=- Daniel Stenberg -=- http://daniel.haxx.se -=- ech`echo xiun|tr nu oc|sed 'sx\([sx]\)\([xoi]\)xo un\2\1 is xg'`ol
Re: wget 1.9 - behaviour change in recursive downloads
Zitat von Hrvoje Niksic [EMAIL PROTECTED]: Jochen Roderburg [EMAIL PROTECTED] writes: Zitat von Hrvoje Niksic [EMAIL PROTECTED]: It's a feature. `-A zip' means `-A zip', not `-A zip,html'. Wget downloads the HTML files only because it absolutely has to, in order to recurse through them. After it finds the links in them, it deletes them. Hmm, so it has really been an undetected error over all the years ;-) ? s/undetected/unfixed/ At least I've always considered it an error. I didn't know people depended on it. Well, *depend* is a rather strong expression for that ;-) It worked that way always, I got used to it, I never really thought if it was correct or not, because I had a use for it. So I was astonished, when these files suddenly disappeared. As I wrote already, I will mention them explicitly now. I think, the worst that will happen is that I get a few more of them than before. Perhaps the whole thing could be mentioned in the documentation of the accept/reject option. Current there is only this sentence there: Note that these two options do not affect the downloading of HTML files; Wget must load all the HTMLs to know where to go at all--recursive retrieval would make no sense otherwise. J. Roderburg
Re: some wget patches against beta3
Hrvoje Niksic [EMAIL PROTECTED] writes: As for the Polish translation, translations are normally handled through the Translation Project. The TP robot is currently down, but I assume it will be back up soon, and then we'll submit the POT file and update the translations /en masse/. It took a little bit longer than expected but now, the robot is up and running again. This morning (CET) I installed b3 for translation.
Re: -q and -S are incompatible
Dan Jacobson [EMAIL PROTECTED] writes: -q and -S are incompatible and should perhaps produce errors and be noted thus in the docs. They seem to work as I'd expect -- `-q' tells Wget to print *nothing*, and that's what happens. The output Wget would have generated does contain HTTP headers, among other things, but it never gets printed. BTW, there seems no way to get the -S output, but no progress indicator. -nv, -q kill them both. It's a bug that `-nv' kills `-S' output, I think. P.S. one shouldn't have to confirm each bug submission. Once should be enough. You're right. :-( I'll ask the sunsite people if there's a way to establish some form of white lists...
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: As for the Polish translation, translations are normally handled through the Translation Project. The TP robot is currently down, but I assume it will be back up soon, and then we'll submit the POT file and update the translations /en masse/. It took a little bit longer than expected but now, the robot is up and running again. This morning (CET) I installed b3 for translation. However, http://www2.iro.umontreal.ca/~gnutra/registry.cgi?domain=wget still shows `wget-1.8.2.pot' to be the current template for [the] domain. Also, my Croatian translation of 1.9 doesn't seem to have made it in. Is that expected?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Also, my Croatian translation of 1.9 doesn't seem to have made it in. Is that expected? Unfortunately, yes. Will you please resubmit it with the subject line updated (IIRC, it's now): TP-Robot wget-1.9-b3.hr.po I'm not sure what b3 is, but the version in the POT file was supposed to be beta3. Was there a misunderstanding somewhere along the line?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: I'm not sure what b3 is, but the version in the POT file was supposed to be beta3. Was there a misunderstanding somewhere along the line? Yes, the robot does not like beta3 as part of the version string. b3 or pre3 are okay. Ouch. Why does the robot care about version names at all?
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: Hrvoje Niksic [EMAIL PROTECTED] writes: Ouch. Why does the robot care about version names at all? It must know about the sequences; this is important for merging issues. IIRC, we have at least these sequences supported by the robot: 1.2 - 1.2.1 - 1.2.2 - 1.3 etc. 1.2 - 1.2a - 1.2b - 1.3 1.2 - 1.3-pre1 - 1.3-pre2 - 1.3 1.2 - 1.3-b1 - 1.3-b2 - 1.3 Thanks for the clarification, Karl. But as a maintainer of a project that tries to use the robot, I must say that I'm not happy about this. If the robot absolutely must be able to collate versions, then it should be smarter about it and support a larger array of formats in use out there. See `dpkg' for an example of how it can be done, although the TP robot certainly doesn't need to do all that `dpkg' does. This way, unless I'm missing something, the robot seems to be in the position to dictate its very narrow-minded versioning scheme to the projects that would only like to use it (the robot). That's really bad. But what's even worse is that something or someone silently changed beta3 to b3 in the POT, and then failed to perform the same change for my translation, which caused it to get dropped without notice. Returning an error that says your version number is unparsable to this piece of software, you must use one of ... instead would be more correct in the long run. Is the robot written in Python? Would you consider it for inclusion if I donated a function that performed the comparison more fully (provided, of course, that the code meets your standards of quality)?
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to @code{--post-file} must be a regular file; specifying a FIFO or something like @file{/dev/stdin} won't work. There's nothing that says you have to read the data after you've started sending the POST. Why not just read the --post-file before constructing the request so that you know how big it is? My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. How do browsers figure out whether they can do a chunked transfer or not? Tony
Re: Using chunked transfer for HTTP requests?
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: Please be aware that Wget needs to know the size of the POST data in advance. Therefore the argument to @code{--post-file} must be a regular file; specifying a FIFO or something like @file{/dev/stdin} won't work. There's nothing that says you have to read the data after you've started sending the POST. Why not just read the --post-file before constructing the request so that you know how big it is? I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). What the current code does is: determine the file size, send Content-Length, read the file in chunks (up to the promised size) and send those chunks to the server. But that works only with regular files. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. How do browsers figure out whether they can do a chunked transfer or not? I haven't checked, but I'm 99% convinced that browsers simply don't give a shit about non-regular files.
Re: Using chunked transfer for HTTP requests?
Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje Niksic: What the current code does is: determine the file size, send Content-Length, read the file in chunks (up to the promised size) and send those chunks to the server. But that works only with regular files. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin That would indeed be nice. Since I'm coming from the WebDAV side of life: does wget allow the use of PUT? My first impulse was to bemoan Wget's antiquated HTTP code which doesn't understand chunked transfer. But, coming to think of it, even if Wget used HTTP/1.1, I don't see how a client can send chunked requests and interoperate with HTTP/1.0 servers. How do browsers figure out whether they can do a chunked transfer or not? I haven't checked, but I'm 99% convinced that browsers simply don't give a shit about non-regular files. That's probably true. But have you tried sending without Content-Length and Connection: close and closing the output side of the socket before starting to read the reply from the server? //Stefan
Re: some wget patches against beta3
Karl Eichwalder [EMAIL PROTECTED] writes: I guess, you as the wget maintainer switched from something supported to the unsupported betaX scheme and now we have something to talk about ;) I had no idea that something as usual as betaX was unsupported. In fact, I believe that bX was added when Francois saw me using it in Wget. :-) Using something different then exactly wget-1.9-b3.de.po will confuse the robot sigh Returning an error that says your version number is unparsable to this piece of software, you must use one of ... instead would be more correct in the long run. Sure. You should have receive a message like this, didn't you? I didn't. Maybe it was an artifact of robot not having worked at the time, though.
Re: Using chunked transfer for HTTP requests?
Stefan Eissing [EMAIL PROTECTED] writes: Am Dienstag, 07.10.03, um 16:36 Uhr (Europe/Berlin) schrieb Hrvoje Niksic: What the current code does is: determine the file size, send Content-Length, read the file in chunks (up to the promised size) and send those chunks to the server. But that works only with regular files. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin That would indeed be nice. Since I'm coming from the WebDAV side of life: does wget allow the use of PUT? No. I haven't checked, but I'm 99% convinced that browsers simply don't give a shit about non-regular files. That's probably true. But have you tried sending without Content-Length and Connection: close and closing the output side of the socket before starting to read the reply from the server? That might work, but it sounds too dangerous to do by default, and too obscure to devote a command-line option to. Besides, HTTP/1.1 *requires* requests with a request-body to provide Conent-Length: For compatibility with HTTP/1.0 applications, HTTP/1.1 requests containing a message-body MUST include a valid Content-Length header field unless the server is known to be HTTP/1.1 compliant.
Re: Using chunked transfer for HTTP requests?
Am Dienstag, 07.10.03, um 17:02 Uhr (Europe/Berlin) schrieb Hrvoje Niksic: That's probably true. But have you tried sending without Content-Length and Connection: close and closing the output side of the socket before starting to read the reply from the server? That might work, but it sounds too dangerous to do by default, and too obscure to devote a command-line option to. Besides, HTTP/1.1 *requires* requests with a request-body to provide Conent-Length: For compatibility with HTTP/1.0 applications, HTTP/1.1 requests containing a message-body MUST include a valid Content-Length header field unless the server is known to be HTTP/1.1 compliant. I just checked with RFC 1945 and it explicitly says that POSTs must carry a valid Content-Length header. That leaves the option of first sending an OPTIONS request to the server (either url or *) to check the HTTP version. //Stefan
[PATCH] wget-1.8.2: Portability, plus EBCDIC patch
Hello Hrvoje and Dan, I have been using wget for many years now, and finally got to applying a patch I made long ago (EBCDIC patch against wget-1.5.3) to the current wget-1.8.2. This patch makes wget compile and run on a mainframe computer using the EBCDIC character set. Also, when compiling wget on Solaris (using the SUNWspro Forte compiler), I stumbled over a portability problem (C++ comments in a C source) to which I add a patch as well. About the EBCDIC patch: * The goal was to create a patch which worked for our EBCDIC system (Fujitsu-Siemens' mainframe OS is called BS2000, it runs on /390 hardware, but is not compatible with OS/390 per se) but would be easily adaptable to OS/390 (to which I have no access, but whose behaviour I know from similar ports). The code to actually make it work for OS/390 is not in place, but I added a tool (called safe-ctype-mk.c -- delete if you don't like it) to create the additions to safe-ctype.c which are necessary because IBM's EBCDIC differs from our EBCDIC. * Because code conversion is necessary for text files, a distiction between text and binary download was added (based on the downloaded MIME type; see the routines http_set_convert_flag() and http_get_convert_flag(). A future patch may add a new --conversion=text/binary/auto switch which is not implemented yet.) Currently, the same heuristics are used as in the Apache HTTP server to determine whether conversion is required (for several kinds of text files) or not required (for images, compressed files etc.) * Because EBCDIC alphabetic characters live in the range between '\xA1' and '\xE9', the getopt_long() numbers have been shifted up by 200, beyond the 0xFF boundary, to avoid conflicts between single-character options and numeric long-option values. That does not change the behaviour on ASCII machines, but allows the source to compile on EBCDIC machines (otherwise: error: multiple case in switch). * wget-1.8.2 has been compiled on our BS2000, with the patch applied, and with SSL enabled (against openssl-0.9.6k), and has been tested to work correctly. If you would add the patch to future versions of wget, then all users of our BS2000 as well as users of IBM's OS/390 could take advantage of the availability of wget for EBCDIC-based machines, and hopefully someone would also contribute the missing IBM-EBCDIC counterparts to our BS2000-EBCDIC patch. Martin -- [EMAIL PROTECTED] | Fujitsu Siemens Fon: +49-89-636-46021, FAX: +49-89-636-47655 | 81730 Munich, Germany diff -bur wget-1.8.2/src/ftp.c work/wget-1.8.2/src/ftp.c --- wget-1.8.2/src/ftp.c.orig 2003-10-06 17:20:58.710178000 +0200 +++ wget-1.8.2/src/ftp.c2003-10-06 17:17:00.399371000 +0200 @@ -474,7 +474,7 @@ } err = ftp_size(con-rbuf, u-file, len); -// printf(\ndebug: %lld\n, *len); +/* printf(\ndebug: %lld\n, *len); */ /* FTPRERR */ switch (err) { diff -bur wget-1.8.2/src/http.c work/wget-1.8.2/src/http.c --- wget-1.8.2/src/http.c.orig 2003-10-06 17:20:58.900182000 +0200 +++ wget-1.8.2/src/http.c 2003-10-06 17:19:16.829836000 +0200 @@ -1777,7 +1777,7 @@ FREE_MAYBE (dummy); return RETROK; } -// fprintf(stderr, test: hstat.len: %lld, hstat.restval: %lld\n, hstat.dltime); +/* fprintf(stderr, test: hstat.len: %lld, hstat.restval: %lld\n, hstat.dltime); */ tmrate = retr_rate (hstat.len - hstat.restval, hstat.dltime, 0); if (hstat.len == hstat.contlen) diff -bur wget-1.8.2.orig/src/connect.c wget-1.8.2/src/connect.c --- wget-1.8.2.orig/src/connect.c Mon Oct 6 17:13:11 2003 +++ wget-1.8.2/src/connect.cMon Oct 6 17:10:28 2003 @@ -47,6 +47,10 @@ #endif #endif /* WINDOWS */ +#if #system(bs2000) +#include ascii_ebcdic.h +#endif + #include errno.h #ifdef HAVE_STRING_H # include string.h @@ -73,6 +77,26 @@ to connect_to_one. */ static const char *connection_host_name; +#if 'A' == '\xC1' /* CHARSET_EBCDIC */ +/* Start off with convert=1 (headers are always converted) */ +static int convert_flag_last_reply = 1; + +void +http_set_convert_flag(const char *type) +{ +convert_flag_last_reply = + (strncasecmp(type, text/, 5) == 0 + || strncasecmp(type, message/, 8) == 0 + || strcasecmp(type, application/postscript) == 0); +} + +int +http_get_convert_flag() +{ +return convert_flag_last_reply; +} +#endif + void set_connection_host_name (const char *host) { @@ -459,6 +483,11 @@ } while (res == -1 errno == EINTR); +#if 'A' == '\xC1' + if (res 0 http_get_convert_flag()) +_a2e_n(buf,res); +#endif + return res; } @@ -472,6 +501,25 @@ { int res = 0; +#if 'A' == '\xC1' /* CHARSET_EBCDIC */ + static char *cbuf = NULL; + static int csize = 0; + + if (len csize) { +if (cbuf != NULL) + free(cbuf); +cbuf = malloc(csize = len+8192); /* add arbitrary amount of skew */ +if
Re: [PATCH] wget-1.8.2: Portability, plus EBCDIC patch
Martin, thanks for the patch and the detailed report. Note that it might have made more sense to apply the patch to the latest CVS version, which is somewhat different from 1.8.2. I'm really not sure whether to add this patch. On the one hand, it's nice to support as many architectures as possible. But on the other hand, most systems are ASCII. All the systems I've ever seen or worked on have been ASCII. I am fairly certain that I would not be able to support EBCDIC in the long run and that, unless someone were to continually support EBCDIC, the existing support would bitrot away. Is anyone on the Wget list using an EBCDIC system?
Major, and seemingly random problems with wget 1.8.2
Hello, I have noticed very unpredictable behavior from wget 1.8.2 - specifically I have noticed two things: a) sometimes it does not follow all of the links it should b) sometimes wget will follow links to other sites and URLs - when the command line used should not allow it to do that. Here are the details. First, sometimes when you attempt to download a site with -k -m (--convert-links and --mirror) wget will not follow all of the links and will skip some of the files! I have no idea why it does this with some sites and doesn't do it with other sites. Here is an example that I have reproduced on several systems - all with 1.8.2: # wget -k -m http://www.zorg.org/vsound/ --17:09:32-- http://www.zorg.org/vsound/ = `www.zorg.org/vsound/index.html' Resolving www.zorg.org... done. Connecting to www.zorg.org[213.232.100.31]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] [ = ] 12,23553.82K/s Last-modified header missing -- time-stamps turned off. 17:09:32 (53.82 KB/s) - `www.zorg.org/vsound/index.html' saved [12235] FINISHED --17:09:32-- Downloaded: 12,235 bytes in 1 files Converting www.zorg.org/vsound/index.html... 2-6 Converted 1 files in 0.03 seconds. What is the problem here ? When I run the exact same command line with wget 1.6, I get this: # wget -k -m http://www.zorg.org/vsound/ --11:10:06-- http://www.zorg.org/vsound/ = `www.zorg.org/vsound/index.html' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] 0K - .. . Last-modified header missing -- time-stamps turned off. 11:10:07 (71.12 KB/s) - `www.zorg.org/vsound/index.html' saved [12235] Loading robots.txt; please ignore errors. --11:10:07-- http://www.zorg.org/robots.txt = `www.zorg.org/robots.txt' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 404 Not Found 11:10:07 ERROR 404: Not Found. --11:10:07-- http://www.zorg.org/vsound/vsound.jpg = `www.zorg.org/vsound/vsound.jpg' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 27,629 [image/jpeg] 0K - .. .. .. [100%] 11:10:08 (51.49 KB/s) - `www.zorg.org/vsound/vsound.jpg' saved [27629/27629] --11:10:09-- http://www.zorg.org/vsound/vsound-0.2.tar.gz = `www.zorg.org/vsound/vsound-0.2.tar.gz' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 108,987 [application/x-tar] 0K - .. .. .. .. .. [ 46%] 50K - .. .. .. .. .. [ 93%] 100K - .. [100%] 11:10:12 (46.60 KB/s) - `www.zorg.org/vsound/vsound-0.2.tar.gz' saved [108987/108987] --11:10:12-- http://www.zorg.org/vsound/vsound-0.5.tar.gz = `www.zorg.org/vsound/vsound-0.5.tar.gz' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 116,904 [application/x-tar] 0K - .. .. .. .. .. [ 43%] 50K - .. .. .. .. .. [ 87%] 100K - .. [100%] 11:10:14 (60.44 KB/s) - `www.zorg.org/vsound/vsound-0.5.tar.gz' saved [116904/116904] --11:10:14-- http://www.zorg.org/vsound/vsound = `www.zorg.org/vsound/vsound' Connecting to www.zorg.org:80... connected! HTTP request sent, awaiting response... 200 OK Length: 3,365 [text/plain] 0K - ...[100%] 11:10:14 (3.21 MB/s) - `www.zorg.org/vsound/vsound' saved [3365/3365] Converting www.zorg.org/vsound/index.html... done. FINISHED --11:10:14-- Downloaded: 269,120 bytes in 5 files Converting www.zorg.org/vsound/index.html... done. See ? It gets the links inside of index.html, and mirrors those links, and converts them - just like it should. Why does 1.8.2 have a problem with this site ? Other sites are handled just fine by 1.8.2 with the same command line ... it makes no sense that wget 1.8.2 has problems with particular web sites. This is incorrect behavior - and if you try the same URL with 1.8.2 you can reproduce the same results. The second problem, and I cannot currently give you an example to try yourself but _it does happen_, is if you use this command line: wget --tries=inf -nH --no-parent --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf --convert-links --html-extension --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com At first it will act normally, just going over the site in question, but sometimes, you will come back to the terminal and see if grabbing all sorts of pages from totally different sites (!) I have seen this happen
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). I was proposing that you read the file to determine the length, but that was on the assumption that you could read the input twice, which won't work with the example you proposed. It would be really nice to be able to say something like: mkisofs blabla | wget http://burner/localburn.cgi --post-file /dev/stdin Stefan Eissing wrote: I just checked with RFC 1945 and it explicitly says that POSTs must carry a valid Content-Length header. In that case, Hrvoje will need to get creative. :-) Can you determine if --post-file is a regular file? If so, I still think you should just read (or otherwise examine) the file to determine the length. For other types of input, perhaps you want write the input to a temporary file. Tony
Re: Using chunked transfer for HTTP requests?
Tony Lewis [EMAIL PROTECTED] writes: Hrvoje Niksic wrote: I don't understand what you're proposing. Reading the whole file in memory is too memory-intensive for large files (one could presumably POST really huge files, CD images or whatever). I was proposing that you read the file to determine the length, but that was on the assumption that you could read the input twice, which won't work with the example you proposed. In fact, it won't work with anything except regular files and links to them. Can you determine if --post-file is a regular file? Yes. If so, I still think you should just read (or otherwise examine) the file to determine the length. That's how --post-file works now. The problem is that it doesn't work for non-regular files. My first message explains it, or at least tries to. For other types of input, perhaps you want write the input to a temporary file. That would work for short streaming, but would be pretty bad in the mkisofs example. One would expect Wget to be able to stream the data to the server, and that's just not possible if the size needs to be known in advance, which HTTP/1.0 requires.
Re: Major, and seemingly random problems with wget 1.8.2
Josh Brooks [EMAIL PROTECTED] writes: I have noticed very unpredictable behavior from wget 1.8.2 - specifically I have noticed two things: a) sometimes it does not follow all of the links it should b) sometimes wget will follow links to other sites and URLs - when the command line used should not allow it to do that. Thanks for the report. A more detailed response follows below: First, sometimes when you attempt to download a site with -k -m (--convert-links and --mirror) wget will not follow all of the links and will skip some of the files! I have no idea why it does this with some sites and doesn't do it with other sites. Here is an example that I have reproduced on several systems - all with 1.8.2: Links are missed on some sites because of the use of incorrect comments. This has been fixed for Wget 1.9, where a more relaxed comment parsing code is the default. But that's not the case for www.zorg.org/vsound/. www.zorg.org/vsound/ contains this markup: META NAME=ROBOTS CONTENT=NOFOLLOW That explicitly tells robots, such as Wget, not to follow the links in the page. Wget respects this and does not follow the links. You can tell Wget to ignore the robot directives. For me, this works as expected: wget -km -e robots=off http://www.zorg.org/vsound/ You can put `robots=off' in your .wgetrc and this problem will not bother you again. The second problem, and I cannot currently give you an example to try yourself but _it does happen_, is if you use this command line: wget --tries=inf -nH --no-parent --directory-prefix=/usr/data/www.explodingdog.com--random-wait -r -l inf --convert-links --html-extension --user-agent=Mozilla/4.0 (compatible; MSIE 6.0; AOL 7.0; Windows NT 5.1) www.example.com At first it will act normally, just going over the site in question, but sometimes, you will come back to the terminal and see if grabbing all sorts of pages from totally different sites (!) The only way I've seen it happen is when it follows a redirection to a different site. The redirection is followed because it's considered to be part of the same download. However, further links on the redirected site are not (supposed to be) followed. If you have a repeatable example, please mail it here so we can examine it in more detail.
Re: Major, and seemingly random problems with wget 1.8.2
Thank you for the great response. It is much appreciated - see below... On Tue, 7 Oct 2003, Hrvoje Niksic wrote: www.zorg.org/vsound/ contains this markup: META NAME=ROBOTSCONTENT=NOFOLLOW That explicitly tells robots, such as Wget, not to follow the links in the page. Wget respects this and does not follow the links. You can tell Wget to ignore the robot directives. For me, this works as expected: wget -km -e robots=off http://www.zorg.org/vsound/ Perfect - thank you. At first it will act normally, just going over the site in question, but sometimes, you will come back to the terminal and see if grabbing all sorts of pages from totally different sites (!) The only way I've seen it happen is when it follows a redirection to a different site. The redirection is followed because it's considered to be part of the same download. However, further links on the redirected site are not (supposed to be) followed. Ok, is there a way to tell wget not to follow redirects, so it will not ever do that at all ? Basically I am looking for a way to tell wget don't ever get anything with a different FQDN than what I started you with thanks.
Re: Using chunked transfer for HTTP requests?
Hrvoje Niksic wrote: That would work for short streaming, but would be pretty bad in the mkisofs example. One would expect Wget to be able to stream the data to the server, and that's just not possible if the size needs to be known in advance, which HTTP/1.0 requires. One might expect it, but if it's not possible using the HTTP protocol, what can you do? :-)
Re: Web page source using wget?
Thanks everyone for the replies so far.. The problem I am having is that the customer is using ASP Java script. The URL stays the same as I click through the links. So, using wget URL for the page I want may not work (I may be wrong). Any suggestions on how I can tackle this? Thanks, Suhas - Original Message - From: Hrvoje Niksic [EMAIL PROTECTED] To: Suhas Tembe [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Monday, October 06, 2003 5:19 PM Subject: Re: Web page source using wget? Suhas Tembe [EMAIL PROTECTED] writes: Hello Everyone, I am new to this wget utility, so pardon my ignorance.. Here is a brief explanation of what I am currently doing: 1). I go to our customer's website every day log in using a User Name Password. 2). I click on 3 links before I get to the page I want. 3). I right-click on the page choose view source. It opens it up in Notepad. 4). I save the source to a file subsequently perform various tasks on that file. As you can see, it is a manual process. What I would like to do is automate this process of obtaining the source of a page using wget. Is this possible? Maybe you can give me some suggestions. It's possible, in fact it's what Wget does in its most basic form. Disregarding authentication, the recipe would be: 1) Write down the URL. 2) Type `wget URL' and you get the source of the page in file named SOMETHING.html, where SOMETHING is the file name that the URL ends with. Of course, you will also have to specify the credentials to the page, and Tony explained how to do that.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: Thanks everyone for the replies so far.. The problem I am having is that the customer is using ASP Java script. The URL stays the same as I click through the links. URL staying the same is usually a sign of the use of frame, not of ASP and JavaScript. Instead of looking at the URL entry field, try using copy link to clipboard instead of clicking on the last link. Then use Wget on that.
Re: Web page source using wget?
Got it! Thanks! So far so good. After logging-in, I was able to get to the page I am interested in. There was one thing that I forgot to mention in my earlier posts (I apologize)... this page contains a drop-down list of our customer's locations. At present, I choose one location from the drop-down list click submit to get the data, which is displayed in a report format. I right-click then choose view source save source to a file. I then choose the next location from the drop-down list, click submit again. I again do a view source save the source to another file and so on for all their locations. I am not quite sure how to automate this process! How can I do this non-interactively? especially the submit portion of the page. Is this possible using wget? Thanks, Suhas - Original Message - From: Hrvoje Niksic [EMAIL PROTECTED] To: Suhas Tembe [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, October 07, 2003 5:02 PM Subject: Re: Web page source using wget? Suhas Tembe [EMAIL PROTECTED] writes: Thanks everyone for the replies so far.. The problem I am having is that the customer is using ASP Java script. The URL stays the same as I click through the links. URL staying the same is usually a sign of the use of frame, not of ASP and JavaScript. Instead of looking at the URL entry field, try using copy link to clipboard instead of clicking on the last link. Then use Wget on that.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: this page contains a drop-down list of our customer's locations. At present, I choose one location from the drop-down list click submit to get the data, which is displayed in a report format. I right-click then choose view source save source to a file. I then choose the next location from the drop-down list, click submit again. I again do a view source save the source to another file and so on for all their locations. It's possible to automate this, but it requires some knowledge of HTML. Basically, you need to look at the form.../form part of the page and find the select tag that defines the drop-down. Assuming that the form looks like this: form action=http://foo.com/customer; method=GET select name=location option value=caCalifornia option value=maMassachussetts ... /select /form you'd automate getting the locations by doing something like: for loc in ca ma ... do wget http://foo.com/customer?location=$loc; done Wget will save the respective sources in files named customer?location=ca, customer?location=ma, etc. But this was only an example. The actual process depends on what's in the form, and it might be considerably more complex than this.
Re: Web page source using wget?
It does look a little complicated This is how it looks: form action=InventoryStatus.asp method=post name=select onsubmit=return select_validate(); style=margin:0 div style=margin-top:10px table border=1 bordercolor=#d9d9d9 bordercolordark=#ff bordercolorlight=#d9d9d9 cellpadding=3 cellspacing=0 width=100% tr td style=font-weight:bold;color:black;background-color:#CC;text-align:right width=20%nobrSuppliernbsp;/nobr/td td style=color:black;background-color:#F0;text-align:left colspan=2nobrselect name=cboSupplieroption value=4541-134289454A/option option value=4542-134289 selected454B/option/select img id=cboSupplier_icon name=cboSupplier_icon src=../images/required.gif alt=*/nobr/td /tr tr td style=font-weight:bold;color:black;background-color:#CC;text-align:right width=20%nobrQuantity Statusnbsp;/nobr/td td style=color:black;background-color:#F0;text-align:left colspan=2 table border=0 cellpadding=0 cellspacing=0 tr td table border=0 tr td width=1input id=choice_IDAMCB3B name=status type=radio value=over/td td style=color:black;background-color:#F0;text-align:leftspan onclick=choice_IDAMCB3B.checked=true; Over/span/td td width=1input id=choice_IDARCB3B name=status type=radio value=under/td td style=color:black;background-color:#F0;text-align:leftspan onclick=choice_IDARCB3B.checked=true; Under/span/td td width=1input id=choice_IDAWCB3B name=status type=radio value=both/td td style=color:black;background-color:#F0;text-align:leftspan onclick=choice_IDAWCB3B.checked=true; Both/span/td td width=1input id=choice_IDA1CB3B name=status type=radio value=all checked/td td style=color:black;background-color:#F0;text-align:leftspan onclick=choice_IDA1CB3B.checked=true; All/span/td /tr /table /td td img id=status_icon name=status_icon src=../images/blank.gif alt=/td /tr /table /td /tr tr td style=font-weight:bold;color:black;background-color:#CCnbsp;/td td colspan=2 style=font-weight:bold;color:black;background-color:#CC;text-align:leftinput type=submit name=action-select value=Query onclick=doValidate = true; /td /tr /table /div /form I don't see any specific URL that would get the relevant data after I hit submit. Maybe I am missing something... Thanks, Suhas - Original Message - From: Hrvoje Niksic [EMAIL PROTECTED] To: Suhas Tembe [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, October 07, 2003 5:24 PM Subject: Re: Web page source using wget? Suhas Tembe [EMAIL PROTECTED] writes: this page contains a drop-down list of our customer's locations. At present, I choose one location from the drop-down list click submit to get the data, which is displayed in a report format. I right-click then choose view source save source to a file. I then choose the next location from the drop-down list, click submit again. I again do a view source save the source to another file and so on for all their locations. It's possible to automate this, but it requires some knowledge of HTML. Basically, you need to look at the form.../form part of the page and find the select tag that defines the drop-down. Assuming that the form looks like this: form action=http://foo.com/customer; method=GET select name=location option value=caCalifornia option value=maMassachussetts ... /select /form you'd automate getting the locations by doing something like: for loc in ca ma ... do wget http://foo.com/customer?location=$loc; done Wget will save the respective sources in files named customer?location=ca, customer?location=ma, etc. But this was only an example. The actual process depends on what's in the form, and it might be considerably more complex than this.
Re: Web page source using wget?
Suhas Tembe [EMAIL PROTECTED] writes: It does look a little complicated This is how it looks: form action=InventoryStatus.asp method=post [...] [...] select name=cboSupplier option value=4541-134289454A/option option value=4542-134289 selected454B/option /select Those are the important parts. It's not hard to submit this form. With Wget 1.9, you can even use the POST method, e.g.: wget http://.../InventoryStatus.asp --post-data \ 'cboSupplier=4541-134289status=allaction-select=Query' \ -O InventoryStatus1.asp wget http://.../InventoryStatus.asp --post-data \ 'cboSupplier=4542-134289status=allaction-select=Query' -O InventoryStatus2.asp It might even work to simply use GET, and retrieve http://.../InventoryStatus.asp?cboSupplier=4541-134289status=allaction-select=Query without the need for `--post-data' or `-O', but that depends on the ASP script that does the processing. The harder part is to automate this process for *any* values in the drop-down list. You might need to use an intermediary Perl script that extracts all the option value=... from the HTML source of the page with the drop-down. Then, from the output of the Perl script, you call Wget as shown above. It's doable, but it takes some work. Unfortunately, I don't know of a (command-line) tool that would make this easier.