Hi.
I have found several bugs in wget 1.7.1-pre1 when downloading eBay
auction web pages.
Background:
In order to reduce the time spent checking out eBay auctions, I have
set up some shell scripts to prefetch the auction pages using "wget -p"
in the background while I have my browser (netscape-4.77 under Linux)
display the web pages from my hard disk.
Ebay auction pages change constantly, and so I can't give you a
specific URL to reproduce the problems. You should be able to find
example pretty easily by typing in random "item" numbers. For
example, this auction will end in 4 days:
http://cgi.ebay.com/aw-cgi/eBayISAPI.dll?ViewItem&item=1639000000
If it has expired, just increment the number at the end by some by a
couple hundred thousand.
The first bug I ran into is that eBay does not generate a
"content-type" header, it uses the "<META HTTP-EQUIV="Content-Type""
tag instead. Why? I dunno. I think they may be playing around with
XHTML or XML or something.
Anyway, to fix this bug, I changed http.c to assume the content-type
is html if no header is found. Here is the patch:
--- wget-1.7.1-pre1/src/http.c Sun Sep 16 15:19:52 2001
+++ wget-1.7.1-pre1.wrs/src/http.c Sat Sep 8 20:20:29 2001
@@ -1059,6 +1059,15 @@
xfree (hdr);
}
+ /* ebay doesn't send the content-type: header. I guess most browsers
+ * must assume that stuff is HTML unless told otherwise, so we
+ * should do the same.
+ */
+
+ if (!type )
+ type = xstrdup( TEXTHTML_S );
+
+
logputs (LOG_VERBOSE, "\n");
if (contlen != -1
Next, I had quite a few problems with wget double converting links
that it had downloaded. I can understand why wget needs to make a
second pass, as explained in the code, you don't know what all URLs
are going to be downloaded until you have finished. However, I
couldn't figure out why wget needs to use the first pass at
converting. Why not just convert everything at the end? So, I tried
commenting out the first pass, and found that things worked very
well. Hmmm... I suspect I'm missing something, but applying the
following patch solved my problems:
--- wget-1.7.1-pre1/src/recur.c Thu Jun 14 16:48:00 2001
+++ wget-1.7.1-pre1.wrs/src/recur.c Mon Sep 10 10:03:03 2001
@@ -510,12 +510,14 @@
freeurl (u, 1);
/* Increment the pbuf for the appropriate size. */
}
+#if 0
if (opt.convert_links && !opt.delete_after)
/* This is merely the first pass: the links that have been
successfully downloaded are converted. In the second pass,
convert_all_links() will also convert those links that have NOT
been downloaded to their canonical form. */
convert_links (file, url_list);
+#endif
/* Free the linked list of URL-s. */
free_urlpos (url_list);
/* Free the canonical this_url. */
I also had problems with wget converting internal links (<a href="#DESC">),
So I made the following patch to html-url.c:
--- wget-1.7.1-pre1/src/html-url.c Sun May 27 14:35:02 2001
+++ wget-1.7.1-pre1.wrs/src/html-url.c Mon Sep 10 10:28:33 2001
@@ -320,6 +320,11 @@
memcpy (p, link_uri, hashlen);
p[hashlen] = '\0';
link_uri = p;
+
+ /* if we are just linking within the same html file, we don't want
+ * to do anything */
+ if ( *link_uri == '\0' )
+ return;
}
if (!base)
Finally, I found that wget was quoting the URLS that it converts, but
not quoting the file names that it downloads. So, netscape (4.77 on
Linux) couldn't find many of the links that wget fetched. Again, by
disabling code in wget, I made things work better for me. Here is the
patch to url.c:
--- wget-1.7.1-pre1/src/url.c Sun May 27 14:35:10 2001
+++ wget-1.7.1-pre1.wrs/src/url.c Mon Sep 10 13:00:03 2001
@@ -1403,22 +1403,34 @@
{
/* Convert absolute URL to relative. */
char *newname = construct_relative (file, l->local_name);
+#ifdef QUOTE_URL
char *quoted_newname = html_quote_string (newname);
replace_attr (&p, l->size, fp, quoted_newname);
+#else
+ replace_attr (&p, l->size, fp, newname);
+#endif
DEBUGP (("TO_RELATIVE: %s to %s at position %d in %s.\n",
l->url, newname, l->pos, file));
xfree (newname);
+#ifdef QUOTE_URL
xfree (quoted_newname);
+#endif
}
else if (l->convert == CO_CONVERT_TO_COMPLETE)
{
/* Convert the link to absolute URL. */
char *newlink = l->url;
+#ifdef QUOTE_URL
char *quoted_newlink = html_quote_string (newlink);
replace_attr (&p, l->size, fp, quoted_newlink);
+#else
+ replace_attr (&p, l->size, fp, newlink);
+#endif
DEBUGP (("TO_COMPLETE: <something> to %s at position %d in %s.\n",
newlink, l->pos, file));
+#ifdef QUOTE_URL
xfree (quoted_newlink);
+#endif
}
}
/* Output the rest of the file. */
No, there are still some problems with filenames and url quoting, so I
suspect this patch isn't right. In particular if, say, and image on
the auction comes from "http://foo.com/%7Euser/ebay/image.jpg", wget
will download the file into the %7Euser director, but when netscape
goes looking for it, it will look for the ~user directory. There are
similar problems if the URL has embedded spaces, ampersands, or
percents.
Are these known bugs?
Are there better solutions than what I've done?
-wayne