* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> I'm hoping for ... a "raw" type in addition to "file", >> "domain", etc. > > do you mean you would like to have a regex class working on the > content of downloaded files as well?
Not exactly. (details below) > i don't like your "raw" proposal as it is HTML-specific. i > would like instead to develop a mechanism which could work for > all supported protocols. I see. It would be problematic for other protocols. :( A raw match would be more complicated than I originally thought, because it is HTML-specific and uses extra data which isn't currently available to the filters. Would it be feasible to make "raw" simply return the full URI when the document is not HTML? I think there is some value in matching based on the entire link tag, instead of just the URI. Wget already has --follow-tags and --ignore-tags, and a "raw" match would be like an extension to that concept. I would find it useful to be able to filter according to things which are not part of the URI. For example: follow: <a href="/a38bef9c" class="content">article</a> skip: <a href="/cb31d512" class="advertisement">buy now</a> Either the class property or the visible link text could be used to decide if the link is worth following, but the URI in this case is pretty useless. It may need to be a different option; use "--filter" to filter the URI list, and use "--filter-tag" earlier in the process (same place as "--follow-tags"), to help generate the URI list. Regardless, I think it would be useful. Any thoughts? -- Scott