Scott Scriven wrote:
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

<a 
href="http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&amp;g2_itemId=11436&amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart</a>

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)

regex support is planned for the next release of wget. but i was wondering if we should just extend the existing -A and -R option instead of creating new ones. what do you think?

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Reply via email to