Hi,
I think there has been mail on this issue in the past (especially from
Eddy Thilleman) but it hasn't been adequately addressed IMO. Currently
there is no facility in wget (1.6) to choose which *HTML* links are
followed. All HTML links are followed (controlled by recursion and
other rules) and one can modify which files are actually downloaded
using wildcards in the accept/reject options. 

I had a recent need to download image files pointed to by children of
some top level pages. Each child page unfortunately also pointed to
all it's uncles. Because all the HTML files live in the same
directory, directory based rules did not work. I had to hack recur.c
to apply the acceptance/rejection rules for HTML links too (at lines
316-344). That way I could say

~/wget-1.6/src/wget -p -r -nH --cut-dirs=2 --wait=10
--accept=jpg,\[0-9\]\*_poster.html
--reject=thumb.jpg,\[a-z\]0_poster.html
www.webshots.com/posters/html/art_abstract0_poster.html

[Explanation: the [a-zA-Z]*_poster.html are the top level pages, they
point to the actual poster pages \[0-9\]\*_poster.html which also have
links to all the top level pages. I wanted to download only the
subtree rooted at art_abstract0_poster.html]

I suggest that an option be added (say --checkhtml) to allow
accept/reject rules to be applied to HTML links too.

Thanks, and thanks for wget.

-Ullas

Reply via email to