wget has a -p option:
-p, --page-requisites get all images, etc. needed to display HTML page.
this implies that it'll go and fetch and jpgs, pngs, css
stylesheets etc which are on the page. really useful.
there is another option -np, which is _really_ useful with
mirroring sites:
-np, --no-parent don't ascend to the parent directory.
which does exactly as it says.
using -p and -np together on a URL with one <img src> tag does this:
23:02 bozar:~/tmp% wget -nv -p -np http://www.heebie.net/royo/royo21.htm
23:03:57 URL:http://www.heebie.net/royo/royo21.htm [4571/4571] ->
"www.heebie.net/royo/royo21.htm" [1]
there is one picture referenced by a <img
src="/newimg/something.jpg"> tag in that file, the above
behaviour with -p -np is correct: the image is in /newimg and
the html file is in /royo. so -np works.
i'm proposing that -np should _not_ take affect when you are
dealing with a page requisite (anything other than a .htm or a
.html).
this is really useful, so you can use -np to restrict your
mirroring hierachy to only that directory or below, but you can
still grab images and other requisites from outside of that
directory. this means that if you want to mirror
http://site/foo/bar.html, and bar.html uses images from the
directory http://site/images/, you can just use -p -np and it'll
work fine. the -np being there still means that your mirror
won't avalanche to directories all over the site.
you could always do something like "wget -I/images" to have the
above work, but when you are mirroring with -m, images can come
from _many_ places. massive art galleries often have images in
different directories (/images1, /images2, etc) ... and it's so
convenient to do "wget -p -np" instead of "wget
-I/images1,/images2,/images3,...,/imagesN".
i've submitted a patch, so that the new behaviour with -p -np is
now like this:
9:56 bozar:~/tmp% wget -nv -p -np http://www.heebie.net/royo/royo21.htm
09:56:24 URL:http://www.heebie.net/royo/royo21.htm [4571/4571] ->
"www.heebie.net/royo/royo21.htm" [1]
09:56:24 URL:http://www.heebie.net/newimg/lr1-11.jpg [116352/116352] ->
"www.heebie.net/newimg/lr1-11.jpg" [1]
(the very small) patch is against wget 1.6. it's a dodgey hack
more than anything else, so i don't expect (or suggest) that
this patch should go in as it is ... i think another option or
modifier is needed so that we maintain backward compatibilty
with how -p -np works with 1.6.
perhaps -pf (page-requisites + force), or -npr
(--no-parent-except-requisites)?
if you'd like me to code it up, please let me know ... i believe
this feature is _really_ useful.
thanks!
ps: the wget source is some of the best C i've ever read.
--
#ozone/algorithm <[EMAIL PROTECTED]> - trust.in.love.to.save
--- wget-1.6-orig/src/recur.c Mon Dec 18 06:28:20 2000
+++ wget-1.6/src/recur.c Tue Apr 10 09:54:31 2001
@@ -283,9 +283,13 @@
if (!(base_dir && frontcmp (base_dir, u->dir)))
{
/* Failing that, check for parent dir. */
+ char *suf = NULL;
struct urlinfo *ut = newurl ();
+ suf = suffix (constr);
if (parseurl (this_url, ut, 0) != URLOK)
DEBUGP (("Double yuck! The *base* URL is broken.\n"));
+ else if (opt.page_requisites && strcmp (suf, "html") && strcmp (suf,
+"htm"))
+ DEBUGP (("Escaping no_parent jail since this is a page requisite.\n"));
else if (!frontcmp (ut->dir, u->dir))
{
/* Failing that too, kill the URL. */