Scott Scriven wrote:
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:

wget -r --filter=-domain:www-*.yoyodyne.com

This appears to match "www.yoyodyne.com", "www--.yoyodyne.com",
"www-------.yoyodyne.com", and so on, if interpreted as a regex.

not really. it would not match www.yoyodyne.com.

It would most likely also match "www---zyoyodyneXcom".

yes.

Perhaps you want glob patterns instead?  I know I wouldn't mind having
glob patterns in addition to regexes...  glob is much eaesier
when you're not doing complex matches.

no. i was talking about regexps. they are more expressive and powerful than simple globs. i don't see what's the point in supporting both.

If I had to choose just one though, I'd prefer to use PCRE,
Perl-Compatible Regular Expressions.  They offer a richer, more
concise syntax than traditional regexes, such as \d instead of
[:digit:] or [0-9].

i agree, but adding a dependency from PCRE to wget is asking for infinite maintenance nightmares. and i don't know if we can simply bundle code from PCRE in wget, as it has a BSD license.

--filter=[+|-][file|path|domain]:REGEXP

is it consistent? is it flawed? is there a more convenient one?

It seems like a good idea, but wouldn't actually provide the
regex-filtering features I'm hoping for unless there was a "raw"
type in addition to "file", "domain", etc.  I'll give details
below.  Basically, I need to match based on things like the
inline CSS data, the visible link text, etc.

do you mean you would like to have a regex class working on the content of downloaded files as well?

Below is the original message I sent to the wget list a few
months ago, about this same topic:

=====
I'd find it useful to guide wget by using regular expressions to
control which links get followed.  For example, to avoid
following links based on embedded css styles or link text.

I've needed this several times, but the most recent was when I
wanted to avoid following any "add to cart" or "buy" links on a
site which uses GET parameters instead of directories to select
content.

Given a link like this...

<a 
href="http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&amp;g2_itemId=11436&amp;g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&amp;g2_returnName=album";
 class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart</a>

... a useful parameter could be --ignore-regex='AddToCart|add to cart'
so the class or link text (really, anything inside the tag) could
be used to decide whether the link should be followed.

Or...  if there's already a way to do this, let me know.  I
didn't see anything in the docs, but I may have missed something.

:)
=====

I think what I want could be implemented via the --filter option,
with a few small modifications to what was proposed.  I'm not
sure exactly what syntax to use, but it should be able to specify
whether to include/exclude the link, which PCRE flags to use, how
much of the raw HTML tag to use as input, and what pattern to use
for matching.  Here's an idea:

  --filter=[allow][flags,][scope][:]pattern

Example:

  '--filter=-i,raw:add ?to ?cart'
  (the quotes are there only to make the shell treat it as one parameter)

The details are:

  "allow" is "+" for "include" or "-" for "exclude".
  It defaults to "+" if omitted.

  "flags," is a set of letters to control regex options, followed
  by a comma (to separate it from scope).  For example, "i"
  specifies a case-insensitive search.  These would be the same
  flags that perl appends to the end of search patterns.  So,
  instead of "/foo/i", it would be "--filter=+i,:foo"

  "scope" controls how much of the <a> or similar tag gets used
  as input to the regex.  Values include:
    raw: use the entire tag and all contents (default)
         <a href="/path/to/foo.ext">bar</a>
    domain: use only the domain name
         www.example.com
    file: use only the file name
         foo.ext
    path: use the directory, but not the file name
         /path/to
    others...  can be added as desired

  ":" is required if "allow" or "flags" or "scope" is given

So, for example, to exclude the "add to cart" links in my
previous post, this could be used:

  --filter=-raw:'AddToCart|add to cart'
    or
  --filter=-raw:AddToCart\|add\ to\ cart
    or
  --filter=-:'AddToCart|add to cart'
    or
  --filter=-i,raw:'add ?to ?cart'

Alternately, the --filter option could be split into two options:
one for including content, and one for excluding.  This would be
more consistent with wget's existing parameters, and would
slightly simplify the syntax.

I hope I haven't been to full of hot air.  This is a feature I've
wanted in wget for a long time, and I'm a bit excited that it
might happen soon.  :)

i don't like your "raw" proposal as it is HTML-specific. i would like instead to develop a mechanism which could work for all supported protocols.

--
Aequam memento rebus in arduis servare mentem...

Mauro Tortonesi                          http://www.tortonesi.com

University of Ferrara - Dept. of Eng.    http://www.ing.unife.it
GNU Wget - HTTP/FTP file retrieval tool  http://www.gnu.org/software/wget
Deep Space 6 - IPv6 for Linux            http://www.deepspace6.net
Ferrara Linux User Group                 http://www.ferrara.linux.it

Reply via email to