Re: regex support RFC

Scott Scriven Thu, 30 Mar 2006 10:35:34 -0800

* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
> wget -r --filter=-domain:www-*.yoyodyne.com


This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www-------.yoyodyne.com", and so on, if interpreted as a regex.
It would most likely also match "www---zyoyodyneXcom".  Perhaps
you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.

If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].

> --filter=[+|-][file|path|domain]:REGEXP
> 
> is it consistent? is it flawed? is there a more convenient one?

It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a "raw"
type in addition to "file", "domain", etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.

> please notice that supporting multiple comma-separated regexp in a 
> single --filter option:
> 
> --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,...

Commas for multiple regexes are unnecessary.  Regexes already
have an "or" operator built in.  If you want to match "fee" or
"fie" or "foe" or "fum", the pattern is fee|fie|foe|fum.

> we also have to reach consensus on the filtering algorithm. for
> instance, should we simply require that a url passes all the
> filtering rules to allow its download (just like the current
> -A/R behaviour), or should we instead adopt a short circuit
> algorithm that applies all rules in the same order in which
> they were given in the command line and immediately allows the
> download of an url if it passes the first "allow" match?

Regexes implicitly have "or" functionality built in, via the pipe
operator.  They also have "and" built in simply by extending the
pattern.  To require both "foo" and "bar" in a match, you could
do something like "foo.*bar|bar.*foo".  So, it's not strictly
necessary to support more than one regex unless you specify both
an include pattern and an exclude pattern.

However, if multiple patterns are supported, I think it would be
more helpful to implement them as "and" rather than "or".  This
is just because "and" doubles the length of the filter, so it may
be more convenient to say "--filter=foo --filter=bar" than
"--filter='foo.*bar|bar.*foo'".


Below is the original message I sent to the wget list a few
months ago, about this same topic:

=====
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

<a 
href="http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&amp;g2_itemId=11436&amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart</a>

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=====

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  "allow" is "+" for "include" or "-" for "exclude".
  It defaults to "+" if omitted.

  "flags," is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, "i"
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of "/foo/i", it would be "--filter=+i,:foo"

  "scope" controls how much of the <a> or similar tag gets used
  as input to the regex.  Values include:
    raw: use the entire tag and all contents (default)
         <a href="/path/to/foo.ext">bar</a>
    domain: use only the domain name
         www.example.com
    file: use only the file name
         foo.ext
    path: use the directory, but not the file name
         /path/to
    others...  can be added as desired

  ":" is required if "allow" or "flags" or "scope" is given

So, for example, to exclude the "add to cart" links in my
previous post, this could be used:

  --filter=-raw:'AddToCart|add to cart'
    or
  --filter=-raw:AddToCart\|add\ to\ cart
    or
  --filter=-:'AddToCart|add to cart'
    or
  --filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding.  This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air.  This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon.  :)


-- Scott

Re: regex support RFC

Reply via email to