Re: regex support RFC

Scott Scriven Fri, 31 Mar 2006 10:47:18 -0800

* Mauro Tortonesi <[EMAIL PROTECTED]> wrote:
>> I'm hoping for ... a "raw" type in addition to "file",
>> "domain", etc.
> 
> do you mean you would like to have a regex class working on the
> content of downloaded files as well?


Not exactly.  (details below)

> i don't like your "raw" proposal as it is HTML-specific. i
> would like instead to develop a mechanism which could work for
> all supported protocols.

I see.  It would be problematic for other protocols.  :(
A raw match would be more complicated than I originally thought,
because it is HTML-specific and uses extra data which isn't
currently available to the filters.

Would it be feasible to make "raw" simply return the full URI
when the document is not HTML?

I think there is some value in matching based on the entire link
tag, instead of just the URI.  Wget already has --follow-tags and
--ignore-tags, and a "raw" match would be like an extension to
that concept.  I would find it useful to be able to filter
according to things which are not part of the URI.  For example:

  follow: <a href="/a38bef9c" class="content">article</a>
  skip:   <a href="/cb31d512" class="advertisement">buy now</a>

Either the class property or the visible link text could be used
to decide if the link is worth following, but the URI in this
case is pretty useless.

It may need to be a different option; use "--filter" to filter
the URI list, and use "--filter-tag" earlier in the process (same
place as "--follow-tags"), to help generate the URI list.
Regardless, I think it would be useful.

Any thoughts?


-- Scott

Re: regex support RFC

Reply via email to