Re: wget does not parse .netrc properly

2001-12-18 Thread Hrvoje Niksic

Holger Klawitter [EMAIL PROTECTED] writes:

 I am using wget 1.5.3 under Linux (SuSE 7.1) and I discovered that
 wget fails to parse netrc files if some words contain whitespace.

Wget 1.5.3 is old.  I've now tried putting a quoted password in my
`.netrc', and it works for me with the latest version.



Wget 1.8+CVS not passing referer for recursive retrieval

2001-12-18 Thread Ian Abbott

Although retrieve_tree() stores and retrieves referring URLs in the
URL queue, it does not pass them to retrieve_url(). This seems to
have got lost during the transition from depth-first to breadth-
first retrieval.

This means that HTTP requests for URLs being retrieved at depth
greater than 0 have the Referer set to that set by the --referer
option or nothing at all, and not necessarily the URL of the
referring page.


src/ChangeLog entry:

2001-12-18  Ian Abbott  [EMAIL PROTECTED]

* recur.c (retrieve_tree): Pass on referring URL when retrieving
recursed URL.

Index: src/recur.c
===
RCS file: /pack/anoncvs/wget/src/recur.c,v
retrieving revision 1.37
diff -u -r1.37 recur.c
--- src/recur.c 2001/12/13 19:18:31 1.37
+++ src/recur.c 2001/12/18 13:28:58
@@ -237,7 +237,7 @@
  int oldrec = opt.recursive;
 
  opt.recursive = 0;
- status = retrieve_url (url, file, redirected, NULL, dt);
+ status = retrieve_url (url, file, redirected, referer, dt);
  opt.recursive = oldrec;
 
  if (file  status == RETROK





RE: parameters in the URL

2001-12-18 Thread Herold Heiko

More probably the usual (sigh) file system unusuable character problem (
'?' in this case ). I can;t do it now, try something like
wget -O out.html
http://www.baxleys.org/noah/other_images.php?month_id=11year_id=2001
(mind the wrap)
if you can download that url directly without going through test2.html.
If the page is saved correctly in out.html the problem is indeed the
file system problem.
Heiko

-- 
-- PREVINET S.p.A.[EMAIL PROTECTED]
-- Via Ferretto, 1ph  x39-041-5907073
-- I-31021 Mogliano V.to (TV) fax x39-041-5907087
-- ITALY

 -Original Message-
 From: Alan Eldridge [mailto:[EMAIL PROTECTED]]
 Sent: Tuesday, December 18, 2001 3:05 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: parameters in the URL
 
 
 On Tue, Dec 18, 2001 at 05:55:17AM -0800, Nate Baxley wrote:
 okay, it looks like it is finding the page and getting
 a 200 back from HTTP, but then can't find the page. 
 
 I don't think that's what it's saying at all. Read on...
 
 --22:41:11-- 
 http://www.baxleys.org/noah/other_images.php?month_id=11year_id=2001
 =
 `www.baxleys.org/noah/other_images.php?month_id=11year_id=2001'
 Connecting to www.baxleys.org[66.78.8.167]:80...
 connected.
 HTTP request sent, awaiting response... 200 OK
 Length: unspecified [text/html]
 www.baxleys.org/noah/other_images.php?month_id=11year_id=2001:
 No such file or directory
 
 Cannot write to
 `www.baxleys.org/noah/other_images.php?month_id=11year_id=2001'
 (No such file or directory).
 
 That's a local error. The (f)open on the output failed with 
 errno set to
 ENOENT. 
 
 This could still be a wget problem, if somehow it got 
 confused and managed
 to miss out a mkdir call.
 
 Guess next step is to run it with debugging compiled in and turned on.
 
 -- 
 Alan Eldridge
 Just another $THING hacker.
 



Re: Wget 1.8+CVS not passing referer for recursive retrieval

2001-12-18 Thread Hrvoje Niksic

Ian Abbott [EMAIL PROTECTED] writes:

 Although retrieve_tree() stores and retrieves referring URLs in the
 URL queue, it does not pass them to retrieve_url(). This seems to
 have got lost during the transition from depth-first to breadth-
 first retrieval.

It was an oversight on my part.  Pretty funny, too -- I did all the
work to retain the referrers, except actually *using* them.

Thanks for the patch; I've now applied it to CVS.



Wget 1.8.1-pre2 Problem with -i, -r and -l

2001-12-18 Thread Ian Abbott

I don't have time to look at this problem today, but I thought I'd
mention it now to defer the 1.8.1 release.

If I have a website http://somesite/ with three files on it:
index.html, a.html and b.html, such that index.html links only to
a.html and a.html links only to b.html then the following command
will retrieve all three files:

  wget -r -l 1 http://somesite/index.html http://somesite/a.html

However, if I then create a file 'list' containing the lines:

  http://somesite/index.html
  http://somesite/a.html

and issue the command:

  wget -r -l 1 -i list

then only index.html and a.html are retrieved. I think wget should
also retrieve b.html which is linked to by b.html, i.e. treat the
URLs in the file as though they were specified on the command line.




Re: Wget 1.8.1-pre2 Problem with -i, -r and -l

2001-12-18 Thread Hrvoje Niksic

Ian Abbott [EMAIL PROTECTED] writes:

 If I have a website http://somesite/ with three files on it:
 index.html, a.html and b.html, such that index.html links only to
 a.html and a.html links only to b.html then the following command
 will retrieve all three files:
 
   wget -r -l 1 http://somesite/index.html http://somesite/a.html

Does it?  For me this command retrieves only `index.html' and
`a.html', and that's a bug.  `-i list' makes no different.

For me, this patch fixes the bug in both cases:

2001-12-18  Hrvoje Niksic  [EMAIL PROTECTED]

* recur.c (register_html): Maintain a hash table of HTML files
along with the list.  Disallow duplicates.
(retrieve_tree): Use downloaded_html_set to check whether the file
found in dl_url_file_map is an HTML file, and descend into it if
so.
(convert_all_links): Don't guard against duplicates in
downloaded_html_list, since they are no longer possible.

Index: src/recur.c
===
RCS file: /pack/anoncvs/wget/src/recur.c,v
retrieving revision 1.38
diff -u -r1.38 recur.c
--- src/recur.c 2001/12/18 15:22:03 1.38
+++ src/recur.c 2001/12/18 22:10:56
@@ -53,11 +53,12 @@
 static struct hash_table *dl_file_url_map;
 static struct hash_table *dl_url_file_map;
 
-/* List of HTML files downloaded in this Wget run.  Used for link
-   conversion after Wget is done.  This list should only be traversed
-   in order.  If you need to check whether a file has been downloaded,
-   use a hash table, e.g. dl_file_url_map.  */
-static slist *downloaded_html_files;
+/* List of HTML files downloaded in this Wget run, used for link
+   conversion after Wget is done.  The list and the set contain the
+   same information, except the list maintains the order.  Perhaps I
+   should get rid of the list, it's there for historical reasons.  */
+static slist *downloaded_html_list;
+static struct hash_table *downloaded_html_set;
 
 static void register_delete_file PARAMS ((const char *));
 
@@ -227,8 +228,18 @@
 the second time.  */
   if (dl_url_file_map  hash_table_contains (dl_url_file_map, url))
{
+ file = hash_table_get (dl_url_file_map, url);
+
  DEBUGP ((Already downloaded \%s\, reusing it from \%s\.\n,
-  url, (char *)hash_table_get (dl_url_file_map, url)));
+  url, file));
+
+ /*  This check might be horribly slow when downloading
+sites with a huge number of HTML docs.  Use a hash table
+instead!  Thankfully, it gets tripped only when you use
+`wget -r URL1 URL2 ...', as explained above.  */
+
+ if (string_set_contains (downloaded_html_set, file))
+   descend = 1;
}
   else
{
@@ -815,9 +826,16 @@
 void
 register_html (const char *url, const char *file)
 {
-  if (!opt.convert_links)
+  if (!downloaded_html_set)
+downloaded_html_set = make_string_hash_table (0);
+  else if (hash_table_contains (downloaded_html_set, file))
 return;
-  downloaded_html_files = slist_prepend (downloaded_html_files, file);
+
+  /* The set and the list should use the same copy of FILE, but the
+ slist interface insists on strduping the string it gets.  Oh
+ well. */
+  string_set_add (downloaded_html_set, file);
+  downloaded_html_list = slist_prepend (downloaded_html_list, file);
 }
 
 /* This function is called when the retrieval is done to convert the
@@ -843,23 +861,17 @@
   int file_count = 0;
 
   struct wget_timer *timer = wtimer_new ();
-  struct hash_table *seen = make_string_hash_table (0);
 
   /* Destructively reverse downloaded_html_files to get it in the right order.
  recursive_retrieve() used slist_prepend() consistently.  */
-  downloaded_html_files = slist_nreverse (downloaded_html_files);
+  downloaded_html_list = slist_nreverse (downloaded_html_list);
 
-  for (html = downloaded_html_files; html; html = html-next)
+  for (html = downloaded_html_list; html; html = html-next)
 {
   struct urlpos *urls, *cur_url;
   char *url;
   char *file = html-string;
 
-  /* Guard against duplicates. */
-  if (string_set_contains (seen, file))
-   continue;
-  string_set_add (seen, file);
-
   /* Determine the URL of the HTML file.  get_urls_html will need
 it.  */
   url = hash_table_get (dl_file_url_map, file);
@@ -934,8 +946,6 @@
   wtimer_delete (timer);
   logprintf (LOG_VERBOSE, _(Converted %d files in %.2f seconds.\n),
 file_count, (double)msecs / 1000);
-
-  string_set_free (seen);
 }
 
 /* Cleanup the data structures associated with recursive retrieving
@@ -955,6 +965,8 @@
   hash_table_destroy (dl_url_file_map);
   dl_url_file_map = NULL;
 }
-  slist_free (downloaded_html_files);
-  downloaded_html_files = NULL;
+  if (downloaded_html_set)
+string_set_free (downloaded_html_set);
+  slist_free (downloaded_html_list);
+  

Re: Wget 1.8.1-pre2 Problem with -i, -r and -l

2001-12-18 Thread Hrvoje Niksic

Hrvoje Niksic [EMAIL PROTECTED] writes:

 For me, this patch fixes the bug in both cases:

And introduces a new one.  This patch is required on top of the
previous one.  Or simply upgrade to the latest CVS.

2001-12-18  Hrvoje Niksic  [EMAIL PROTECTED]

* recur.c (retrieve_tree): Make a copy of file obtained from
dl_url_file_map because the code calls xfree(file) later.

Index: src/recur.c
===
RCS file: /pack/anoncvs/wget/src/recur.c,v
retrieving revision 1.39
diff -u -r1.39 recur.c
--- src/recur.c 2001/12/18 22:14:31 1.39
+++ src/recur.c 2001/12/18 22:18:50
@@ -228,15 +228,10 @@
 the second time.  */
   if (dl_url_file_map  hash_table_contains (dl_url_file_map, url))
{
- file = hash_table_get (dl_url_file_map, url);
+ file = xstrdup (hash_table_get (dl_url_file_map, url));
 
  DEBUGP ((Already downloaded \%s\, reusing it from \%s\.\n,
   url, file));
-
- /*  This check might be horribly slow when downloading
-sites with a huge number of HTML docs.  Use a hash table
-instead!  Thankfully, it gets tripped only when you use
-`wget -r URL1 URL2 ...', as explained above.  */
 
  if (string_set_contains (downloaded_html_set, file))
descend = 1;



Re: last wget

2001-12-18 Thread Yuriy Markiv

On 16 Dec 01, at 19:03, Hrvoje Niksic wrote:

  Please tell me where to get last version of wget.
 
 As always, the last released version should be available at
 ftp.gnu.org:/pub/gnu/wget/.

Sorry, I had to mention that I'm looking for last version of wget for 
win9x...
Could you help me with that?
I don't know how to recompile it...

Thanks for your time.
Best regards.

-- 
Yuriy Markiv 
mailto:[EMAIL PROTECTED] http://www.funsms.w.pl ICQ:43998304
PGP: mailto:[EMAIL PROTECTED]?[EMAIL PROTECTED]%0D%0Aexit

 ''During one of my treks through Afghanistan, we lost our
 corkscrew. We were compelled to live on food and water
 for several days.''
W. C. Fields [American actor, 1880-1946]

___
 Fight Spam! Join EuroCAUCE: http://www.euro.cauce.org/ 




Extra newline in output

2001-12-18 Thread Alan Eldridge

There's a garbage newline output in http.c. A noticable effect of
this is when updating a directory using -N, you get a blank line for each
file that is considered for download.

Index: src/http.c
===
RCS file: /pack/anoncvs/wget/src/http.c,v
retrieving revision 1.82
diff -u -3 -u -r1.82 http.c
--- src/http.c  2001/12/13 16:46:56 1.82
+++ src/http.c  2001/12/19 04:42:49
@@ -1067,8 +1067,6 @@
   xfree (hdr);
 }
 
-  logputs (LOG_VERBOSE, \n);
-
   if (contlen != -1
(http_keep_alive_1 || http_keep_alive_2))
 {
   
-- 
Alan Eldridge
Just another $THING hacker.