Re: wget does not parse .netrc properly
Holger Klawitter [EMAIL PROTECTED] writes: I am using wget 1.5.3 under Linux (SuSE 7.1) and I discovered that wget fails to parse netrc files if some words contain whitespace. Wget 1.5.3 is old. I've now tried putting a quoted password in my `.netrc', and it works for me with the latest version.
Wget 1.8+CVS not passing referer for recursive retrieval
Although retrieve_tree() stores and retrieves referring URLs in the URL queue, it does not pass them to retrieve_url(). This seems to have got lost during the transition from depth-first to breadth- first retrieval. This means that HTTP requests for URLs being retrieved at depth greater than 0 have the Referer set to that set by the --referer option or nothing at all, and not necessarily the URL of the referring page. src/ChangeLog entry: 2001-12-18 Ian Abbott [EMAIL PROTECTED] * recur.c (retrieve_tree): Pass on referring URL when retrieving recursed URL. Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.37 diff -u -r1.37 recur.c --- src/recur.c 2001/12/13 19:18:31 1.37 +++ src/recur.c 2001/12/18 13:28:58 @@ -237,7 +237,7 @@ int oldrec = opt.recursive; opt.recursive = 0; - status = retrieve_url (url, file, redirected, NULL, dt); + status = retrieve_url (url, file, redirected, referer, dt); opt.recursive = oldrec; if (file status == RETROK
RE: parameters in the URL
More probably the usual (sigh) file system unusuable character problem ( '?' in this case ). I can;t do it now, try something like wget -O out.html http://www.baxleys.org/noah/other_images.php?month_id=11year_id=2001 (mind the wrap) if you can download that url directly without going through test2.html. If the page is saved correctly in out.html the problem is indeed the file system problem. Heiko -- -- PREVINET S.p.A.[EMAIL PROTECTED] -- Via Ferretto, 1ph x39-041-5907073 -- I-31021 Mogliano V.to (TV) fax x39-041-5907087 -- ITALY -Original Message- From: Alan Eldridge [mailto:[EMAIL PROTECTED]] Sent: Tuesday, December 18, 2001 3:05 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: parameters in the URL On Tue, Dec 18, 2001 at 05:55:17AM -0800, Nate Baxley wrote: okay, it looks like it is finding the page and getting a 200 back from HTTP, but then can't find the page. I don't think that's what it's saying at all. Read on... --22:41:11-- http://www.baxleys.org/noah/other_images.php?month_id=11year_id=2001 = `www.baxleys.org/noah/other_images.php?month_id=11year_id=2001' Connecting to www.baxleys.org[66.78.8.167]:80... connected. HTTP request sent, awaiting response... 200 OK Length: unspecified [text/html] www.baxleys.org/noah/other_images.php?month_id=11year_id=2001: No such file or directory Cannot write to `www.baxleys.org/noah/other_images.php?month_id=11year_id=2001' (No such file or directory). That's a local error. The (f)open on the output failed with errno set to ENOENT. This could still be a wget problem, if somehow it got confused and managed to miss out a mkdir call. Guess next step is to run it with debugging compiled in and turned on. -- Alan Eldridge Just another $THING hacker.
Re: Wget 1.8+CVS not passing referer for recursive retrieval
Ian Abbott [EMAIL PROTECTED] writes: Although retrieve_tree() stores and retrieves referring URLs in the URL queue, it does not pass them to retrieve_url(). This seems to have got lost during the transition from depth-first to breadth- first retrieval. It was an oversight on my part. Pretty funny, too -- I did all the work to retain the referrers, except actually *using* them. Thanks for the patch; I've now applied it to CVS.
Wget 1.8.1-pre2 Problem with -i, -r and -l
I don't have time to look at this problem today, but I thought I'd mention it now to defer the 1.8.1 release. If I have a website http://somesite/ with three files on it: index.html, a.html and b.html, such that index.html links only to a.html and a.html links only to b.html then the following command will retrieve all three files: wget -r -l 1 http://somesite/index.html http://somesite/a.html However, if I then create a file 'list' containing the lines: http://somesite/index.html http://somesite/a.html and issue the command: wget -r -l 1 -i list then only index.html and a.html are retrieved. I think wget should also retrieve b.html which is linked to by b.html, i.e. treat the URLs in the file as though they were specified on the command line.
Re: Wget 1.8.1-pre2 Problem with -i, -r and -l
Ian Abbott [EMAIL PROTECTED] writes: If I have a website http://somesite/ with three files on it: index.html, a.html and b.html, such that index.html links only to a.html and a.html links only to b.html then the following command will retrieve all three files: wget -r -l 1 http://somesite/index.html http://somesite/a.html Does it? For me this command retrieves only `index.html' and `a.html', and that's a bug. `-i list' makes no different. For me, this patch fixes the bug in both cases: 2001-12-18 Hrvoje Niksic [EMAIL PROTECTED] * recur.c (register_html): Maintain a hash table of HTML files along with the list. Disallow duplicates. (retrieve_tree): Use downloaded_html_set to check whether the file found in dl_url_file_map is an HTML file, and descend into it if so. (convert_all_links): Don't guard against duplicates in downloaded_html_list, since they are no longer possible. Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.38 diff -u -r1.38 recur.c --- src/recur.c 2001/12/18 15:22:03 1.38 +++ src/recur.c 2001/12/18 22:10:56 @@ -53,11 +53,12 @@ static struct hash_table *dl_file_url_map; static struct hash_table *dl_url_file_map; -/* List of HTML files downloaded in this Wget run. Used for link - conversion after Wget is done. This list should only be traversed - in order. If you need to check whether a file has been downloaded, - use a hash table, e.g. dl_file_url_map. */ -static slist *downloaded_html_files; +/* List of HTML files downloaded in this Wget run, used for link + conversion after Wget is done. The list and the set contain the + same information, except the list maintains the order. Perhaps I + should get rid of the list, it's there for historical reasons. */ +static slist *downloaded_html_list; +static struct hash_table *downloaded_html_set; static void register_delete_file PARAMS ((const char *)); @@ -227,8 +228,18 @@ the second time. */ if (dl_url_file_map hash_table_contains (dl_url_file_map, url)) { + file = hash_table_get (dl_url_file_map, url); + DEBUGP ((Already downloaded \%s\, reusing it from \%s\.\n, - url, (char *)hash_table_get (dl_url_file_map, url))); + url, file)); + + /* This check might be horribly slow when downloading +sites with a huge number of HTML docs. Use a hash table +instead! Thankfully, it gets tripped only when you use +`wget -r URL1 URL2 ...', as explained above. */ + + if (string_set_contains (downloaded_html_set, file)) + descend = 1; } else { @@ -815,9 +826,16 @@ void register_html (const char *url, const char *file) { - if (!opt.convert_links) + if (!downloaded_html_set) +downloaded_html_set = make_string_hash_table (0); + else if (hash_table_contains (downloaded_html_set, file)) return; - downloaded_html_files = slist_prepend (downloaded_html_files, file); + + /* The set and the list should use the same copy of FILE, but the + slist interface insists on strduping the string it gets. Oh + well. */ + string_set_add (downloaded_html_set, file); + downloaded_html_list = slist_prepend (downloaded_html_list, file); } /* This function is called when the retrieval is done to convert the @@ -843,23 +861,17 @@ int file_count = 0; struct wget_timer *timer = wtimer_new (); - struct hash_table *seen = make_string_hash_table (0); /* Destructively reverse downloaded_html_files to get it in the right order. recursive_retrieve() used slist_prepend() consistently. */ - downloaded_html_files = slist_nreverse (downloaded_html_files); + downloaded_html_list = slist_nreverse (downloaded_html_list); - for (html = downloaded_html_files; html; html = html-next) + for (html = downloaded_html_list; html; html = html-next) { struct urlpos *urls, *cur_url; char *url; char *file = html-string; - /* Guard against duplicates. */ - if (string_set_contains (seen, file)) - continue; - string_set_add (seen, file); - /* Determine the URL of the HTML file. get_urls_html will need it. */ url = hash_table_get (dl_file_url_map, file); @@ -934,8 +946,6 @@ wtimer_delete (timer); logprintf (LOG_VERBOSE, _(Converted %d files in %.2f seconds.\n), file_count, (double)msecs / 1000); - - string_set_free (seen); } /* Cleanup the data structures associated with recursive retrieving @@ -955,6 +965,8 @@ hash_table_destroy (dl_url_file_map); dl_url_file_map = NULL; } - slist_free (downloaded_html_files); - downloaded_html_files = NULL; + if (downloaded_html_set) +string_set_free (downloaded_html_set); + slist_free (downloaded_html_list); +
Re: Wget 1.8.1-pre2 Problem with -i, -r and -l
Hrvoje Niksic [EMAIL PROTECTED] writes: For me, this patch fixes the bug in both cases: And introduces a new one. This patch is required on top of the previous one. Or simply upgrade to the latest CVS. 2001-12-18 Hrvoje Niksic [EMAIL PROTECTED] * recur.c (retrieve_tree): Make a copy of file obtained from dl_url_file_map because the code calls xfree(file) later. Index: src/recur.c === RCS file: /pack/anoncvs/wget/src/recur.c,v retrieving revision 1.39 diff -u -r1.39 recur.c --- src/recur.c 2001/12/18 22:14:31 1.39 +++ src/recur.c 2001/12/18 22:18:50 @@ -228,15 +228,10 @@ the second time. */ if (dl_url_file_map hash_table_contains (dl_url_file_map, url)) { - file = hash_table_get (dl_url_file_map, url); + file = xstrdup (hash_table_get (dl_url_file_map, url)); DEBUGP ((Already downloaded \%s\, reusing it from \%s\.\n, url, file)); - - /* This check might be horribly slow when downloading -sites with a huge number of HTML docs. Use a hash table -instead! Thankfully, it gets tripped only when you use -`wget -r URL1 URL2 ...', as explained above. */ if (string_set_contains (downloaded_html_set, file)) descend = 1;
Re: last wget
On 16 Dec 01, at 19:03, Hrvoje Niksic wrote: Please tell me where to get last version of wget. As always, the last released version should be available at ftp.gnu.org:/pub/gnu/wget/. Sorry, I had to mention that I'm looking for last version of wget for win9x... Could you help me with that? I don't know how to recompile it... Thanks for your time. Best regards. -- Yuriy Markiv mailto:[EMAIL PROTECTED] http://www.funsms.w.pl ICQ:43998304 PGP: mailto:[EMAIL PROTECTED]?[EMAIL PROTECTED]%0D%0Aexit ''During one of my treks through Afghanistan, we lost our corkscrew. We were compelled to live on food and water for several days.'' W. C. Fields [American actor, 1880-1946] ___ Fight Spam! Join EuroCAUCE: http://www.euro.cauce.org/
Extra newline in output
There's a garbage newline output in http.c. A noticable effect of this is when updating a directory using -N, you get a blank line for each file that is considered for download. Index: src/http.c === RCS file: /pack/anoncvs/wget/src/http.c,v retrieving revision 1.82 diff -u -3 -u -r1.82 http.c --- src/http.c 2001/12/13 16:46:56 1.82 +++ src/http.c 2001/12/19 04:42:49 @@ -1067,8 +1067,6 @@ xfree (hdr); } - logputs (LOG_VERBOSE, \n); - if (contlen != -1 (http_keep_alive_1 || http_keep_alive_2)) { -- Alan Eldridge Just another $THING hacker.