* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: > wget -r --filter=-domain:www-*.yoyodyne.com
This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www-------.yoyodyne.com", and so on, if interpreted as a regex. It would most likely also match "www---zyoyodyneXcom". Perhaps you want glob patterns instead? I know I wouldn't mind having glob patterns in addition to regexes... glob is much eaesier when you're not doing complex matches. If I had to choose just one though, I'd prefer to use PCRE, Perl-Compatible Regular Expressions. They offer a richer, more concise syntax than traditional regexes, such as \d instead of [:digit:] or [0-9]. > --filter=[+|-][file|path|domain]:REGEXP > > is it consistent? is it flawed? is there a more convenient one? It seems like a good idea, but wouldn't actually provide the regex-filtering features I'm hoping for unless there was a "raw" type in addition to "file", "domain", etc. I'll give details below. Basically, I need to match based on things like the inline CSS data, the visible link text, etc. > please notice that supporting multiple comma-separated regexp in a > single --filter option: > > --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,... Commas for multiple regexes are unnecessary. Regexes already have an "or" operator built in. If you want to match "fee" or "fie" or "foe" or "fum", the pattern is fee|fie|foe|fum. > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require that a url passes all the > filtering rules to allow its download (just like the current > -A/R behaviour), or should we instead adopt a short circuit > algorithm that applies all rules in the same order in which > they were given in the command line and immediately allows the > download of an url if it passes the first "allow" match? Regexes implicitly have "or" functionality built in, via the pipe operator. They also have "and" built in simply by extending the pattern. To require both "foo" and "bar" in a match, you could do something like "foo.*bar|bar.*foo". So, it's not strictly necessary to support more than one regex unless you specify both an include pattern and an exclude pattern. However, if multiple patterns are supported, I think it would be more helpful to implement them as "and" rather than "or". This is just because "and" doubles the length of the filter, so it may be more convenient to say "--filter=foo --filter=bar" than "--filter='foo.*bar|bar.*foo'". Below is the original message I sent to the wget list a few months ago, about this same topic: ===== I'd find it useful to guide wget by using regular expressions to control which links get followed. For example, to avoid following links based on embedded css styles or link text. I've needed this several times, but the most recent was when I wanted to avoid following any "add to cart" or "buy" links on a site which uses GET parameters instead of directories to select content. Given a link like this... <a href="http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album" class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart</a> ... a useful parameter could be --ignore-regex='AddToCart|add to cart' so the class or link text (really, anything inside the tag) could be used to decide whether the link should be followed. Or... if there's already a way to do this, let me know. I didn't see anything in the docs, but I may have missed something. :) ===== I think what I want could be implemented via the --filter option, with a few small modifications to what was proposed. I'm not sure exactly what syntax to use, but it should be able to specify whether to include/exclude the link, which PCRE flags to use, how much of the raw HTML tag to use as input, and what pattern to use for matching. Here's an idea: --filter=[allow][flags,][scope][:]pattern Example: '--filter=-i,raw:add ?to ?cart' (the quotes are there only to make the shell treat it as one parameter) The details are: "allow" is "+" for "include" or "-" for "exclude". It defaults to "+" if omitted. "flags," is a set of letters to control regex options, followed by a comma (to separate it from scope). For example, "i" specifies a case-insensitive search. These would be the same flags that perl appends to the end of search patterns. So, instead of "/foo/i", it would be "--filter=+i,:foo" "scope" controls how much of the <a> or similar tag gets used as input to the regex. Values include: raw: use the entire tag and all contents (default) <a href="/path/to/foo.ext">bar</a> domain: use only the domain name www.example.com file: use only the file name foo.ext path: use the directory, but not the file name /path/to others... can be added as desired ":" is required if "allow" or "flags" or "scope" is given So, for example, to exclude the "add to cart" links in my previous post, this could be used: --filter=-raw:'AddToCart|add to cart' or --filter=-raw:AddToCart\|add\ to\ cart or --filter=-:'AddToCart|add to cart' or --filter=-i,raw:'add ?to ?cart' Alternately, the --filter option could be split into two options: one for including content, and one for excluding. This would be more consistent with wget's existing parameters, and would slightly simplify the syntax. I hope I haven't been to full of hot air. This is a feature I've wanted in wget for a long time, and I'm a bit excited that it might happen soon. :) -- Scott