Scott Scriven wrote:
I'd find it useful to guide wget by using regular expressions to
control which links get followed. For example, to avoid
following links based on embedded css styles or link text.
I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.
Given a link like this...
<a
href="http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album"
class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart</a>
... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.
Or... if there's already a way to do this, let me know. I
didn't see anything in the docs, but I may have missed something.
:)
regex support is planned for the next release of wget. but i was
wondering if we should just extend the existing -A and -R option instead
of creating new ones. what do you think?
--
Aequam memento rebus in arduis servare mentem...
Mauro Tortonesi http://www.tortonesi.com
University of Ferrara - Dept. of Eng. http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux http://www.deepspace6.net
Ferrara Linux User Group http://www.ferrara.linux.it