Re: regex support RFC
Curtis Hatter wrote: On Friday 31 March 2006 06:52, Mauro Tortonesi: while i like the idea of supporting modifiers like "quick" (short circuit) and maybe "i" (case insensitive comparison), i think that (?i:) and (?-i:) constructs would be overkill and rather hard to implement. I figured that the (?i:) and (?-i:) constructs would be provided by the regular expression engine and that the --filter switch would simply be able to use any construct provided by that engine. i know, that would be really nice. If, as you said, this would be hard to implement or require extra effort by you that is above and beyond that required for the more "standard" constructs then I would say that they shouldn't be implemented; at least at first. i agree. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Hrvoje Niksic wrote: "Tony Lewis" <[EMAIL PROTECTED]> writes: I don't think ",r" complicates the command that much. Internally, the only additional work for supporting both globs and regular expressions is a function that converts a glob into a regexp when ",r" is not requested. That's a straightforward transformation. ",r" makes it harder to input regexps, which are the whole point of introducing --filter. Besides, having two different syntaxes for the same switch, and for no good reason, is not really acceptable, even if the implementation is straightforward. i agree 100%. and don't forget that globs are already supported by current filtering options. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Tony Lewis wrote: Hrvoje Niksic wrote: I don't see a clear line that connects --filter to glob patterns as used by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf you don't need --filter for that. you can simply use -A. I predict that the vast majority of bug reports and support requests will be for users who are trying a glob rather than a regular expression. i you might be right about this. but i think your point of view is somewhat flawed. hrvoje and i designed --filter to extend current wget filtering capabilities, not to replace them. in this sense, --filter should be used only when regex filtering capabilities are needed. if not, -A/R & company are just fine. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
RE: regex support RFC
I agree with Tony.i think most basic users, me included, thought www-*.yoyodyne.com would not match www.yoyodyne.com Support globs as default, regexp as the more powerful option. Ranjit Sandhu SRA -Original Message- From: Tony Lewis [mailto:[EMAIL PROTECTED] Sent: Friday, March 31, 2006 10:03 AM To: wget@sunsite.dk Subject: RE: regex support RFC Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive and powerful > than simple globs. i don't see what's the point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their expressions will simply work, which will result in significant confusion when some expression doesn't work, such as --filter:-domain:www-*.yoyodyne.com. :-) It is pretty easy to programmatically convert a glob into a regular expression. One possibility is to make glob the default input and allow regular expressions. For example, the following could be equivalent: --filter:-domain:www-*.yoyodyne.com --filter:-domain,r:www-.*\.yoyodyne\.com Internally, wget would convert the first into the second and then treat it as a regular expression. For the vast majority of cases, glob will work just fine. One might argue that it's a lot of work to implement regular expressions if the default input format is a glob, but I think we should aim for both lack of confusion and robust functionality. Using ",r" means people get regular expressions when they want them and know what they're doing. The universe of wget users who "know what they're doing" are mostly subscribed to this mailing list; the rest of them send us mail saying "please CC me as I'm not on the list". :-) If we go this route, I'm wondering if the appropriate conversion from glob to regular expression should take directory separators into account, such as: --filter:-path:path/to/* becoming the same as: --filter:-path,r:path/to/[^/]* or even: --filter:-path,r:path[/\\]to[/\\][^/\\]* Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.) Tony
Re: regex support RFC
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> I'm hoping for ... a "raw" type in addition to "file", >> "domain", etc. > > do you mean you would like to have a regex class working on the > content of downloaded files as well? Not exactly. (details below) > i don't like your "raw" proposal as it is HTML-specific. i > would like instead to develop a mechanism which could work for > all supported protocols. I see. It would be problematic for other protocols. :( A raw match would be more complicated than I originally thought, because it is HTML-specific and uses extra data which isn't currently available to the filters. Would it be feasible to make "raw" simply return the full URI when the document is not HTML? I think there is some value in matching based on the entire link tag, instead of just the URI. Wget already has --follow-tags and --ignore-tags, and a "raw" match would be like an extension to that concept. I would find it useful to be able to filter according to things which are not part of the URI. For example: follow: article skip: buy now Either the class property or the visible link text could be used to decide if the link is worth following, but the URI in this case is pretty useless. It may need to be a different option; use "--filter" to filter the URI list, and use "--filter-tag" earlier in the process (same place as "--follow-tags"), to help generate the URI list. Regardless, I think it would be useful. Any thoughts? -- Scott
Re: regex support RFC
> * [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ > > > > soon leads to non wget related links being downloaded, eg. > > http://www.gnu.org/graphics/agnuhead.html > > In that particular case, I think --no-parent would solve the > problem. No. The idea is not to be restricted to not descending the tree. > > Maybe I misunderstood, though. It seems awfully risky to use -r > and -H without having something to strictly limit the links > followed. So, I suppose the content filter would be an effective > way to make cross-host downloading safer. Absolutely. That is why I proposed a 'contents' regexp. > > I think I'd prefer to have a different option, for that sort of > thing -- filter by using external programs. If the program > returns a specific code, follow the link or recurse into the > links contained in the file. Then you could do far more complex > filtering, including things like interactive pruning. True. That could be a future feature request but now that the wget team are writing regexp code, it seems an ideal time to implement it. By constructing suitable regexps, one could use this feature to search for any string in the html file, (as above), or just in metatags etc. IMHO it gives a lot of flexibility for little extra developer programming. Any comments, Mauro & Hrvoje? Thanks Tom Crane -- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: [EMAIL PROTECTED] Fax:+44 (0) 1784 472794
Re: regex support RFC
On Friday 31 March 2006 06:52, Mauro Tortonesi: > while i like the idea of supporting modifiers like "quick" (short > circuit) and maybe "i" (case insensitive comparison), i think that (?i:) > and (?-i:) constructs would be overkill and rather hard to implement. I figured that the (?i:) and (?-i:) constructs would be provided by the regular expression engine and that the --filter switch would simply be able to use any construct provided by that engine. I was more trying to persuade for the use of a regex engine that supports such constructs (like Perl's). Some other constructs I find useful are: (?=), (?!=), (?) These may be overkill but I would rather have the expressiveness of a regex engine like Perls when I need it instead of writing regexs in another engine that have to be twice as long to compensate for the lack of language constructs. Those who don't want to use them, or don't know of them they can write regex's as normal. If, as you said, this would be hard to implement or require extra effort by you that is above and beyond that required for the more "standard" constructs then I would say that they shouldn't be implemented; at least at first. Curtis
RE: regex support RFC
Hrvoje Niksic wrote: > I don't see a clear line that connects --filter to glob patterns as used > by the shell. I want to list all PDFs in the shell, ls -l *.pdf I want a filter to keep all PDFs, --filter=+file:*.pdf Note that "*.pdf" is not a valid regular expression even though it's what most people will try naturally. Perl complains: /*.pdf/: ?+*{} follows nothing in regexp I predict that the vast majority of bug reports and support requests will be for users who are trying a glob rather than a regular expression. Tony
Re: regex support RFC
"Tony Lewis" <[EMAIL PROTECTED]> writes: > I didn't miss the point at all. I'm trying to make a completely different > one, which is that regular expressions will confuse most users (even if you > tell them that the argument to --filter is a regular expression). Well, "most users" will probably not use --filter in the first place. Those that do will have to look at the documentation where they'll find that it accepts regexps. Since Wget is hardly the first program to use regexps, I don't see why most users would be confused by that choice. > Yes, regular expressions are used elsewhere on Unix, but not > everywhere. The shell is the most obvious comparison for user input > dealing with expressions that select multiple objects; the shell > uses globs. I don't see a clear line that connects --filter to glob patterns as used by the shell. If anything, the connection is with grep and other commands that provide powerful filtering (awk and Perl's // operators), which all seem to work on regexps. Where the context can be thought of shell-like (as in wget ftp://blah/*), Wget happily obliges by providing shell-compatible patterns. > I don't think ",r" complicates the command that much. Internally, > the only additional work for supporting both globs and regular > expressions is a function that converts a glob into a regexp when > ",r" is not requested. That's a straightforward transformation. ",r" makes it harder to input regexps, which are the whole point of introducing --filter. Besides, having two different syntaxes for the same switch, and for no good reason, is not really acceptable, even if the implementation is straightforward.
RE: regex support RFC
Hrvoje Niksic wrote: > But that misses the point, which is that we *want* to make the > more expressive language, already used elsewhere on Unix, the > default. I didn't miss the point at all. I'm trying to make a completely different one, which is that regular expressions will confuse most users (even if you tell them that the argument to --filter is a regular expression). This mailing list will get a huge number of bug reports when users try to use globs that fail. Yes, regular expressions are used elsewhere on Unix, but not everywhere. The shell is the most obvious comparison for user input dealing with expressions that select multiple objects; the shell uses globs. Personally, I will be quite happy if --filter only supports regular expressions because I've been using them quite effectively for years. I just don't think the same thing can be said for the typical wget user. We've already had disagreements in this chain about what would match a particular regular expression; I suspect everyone involved in the conversation could have correctly predicted what the equivalent glob would do. I don't think ",r" complicates the command that much. Internally, the only additional work for supporting both globs and regular expressions is a function that converts a glob into a regexp when ",r" is not requested. That's a straightforward transformation. Tony
Re: regex support RFC
"Tony Lewis" <[EMAIL PROTECTED]> writes: > Mauro Tortonesi wrote: > >> no. i was talking about regexps. they are more expressive >> and powerful than simple globs. i don't see what's the >> point in supporting both. > > The problem is that users who are expecting globs will try things like > --filter=-file:*.pdf The --filter command will be documented from the start to support regexps. Since most Unix utilities work with regexps and very few with globs (excepting the shell), this should not be a problem. > It is pretty easy to programmatically convert a glob into a regular > expression. But it's harder to document and explain, and it requires more options and logic. Supporting two different syntaxes (the old one for backward compatibility) is bad enough: supporting three is at least one too many. > One possibility is to make glob the default input and allow regular > expressions. For example, the following could be equivalent: > > --filter:-domain:www-*.yoyodyne.com > --filter:-domain,r:www-.*\.yoyodyne\.com But that misses the point, which is that we *want* to make the more expressive language, already used elsewhere on Unix, the default.
RE: regex support RFC
Mauro Tortonesi wrote: > no. i was talking about regexps. they are more expressive > and powerful than simple globs. i don't see what's the > point in supporting both. The problem is that users who are expecting globs will try things like --filter=-file:*.pdf rather than --filter:-file:.*\.pdf. In many cases their expressions will simply work, which will result in significant confusion when some expression doesn't work, such as --filter:-domain:www-*.yoyodyne.com. :-) It is pretty easy to programmatically convert a glob into a regular expression. One possibility is to make glob the default input and allow regular expressions. For example, the following could be equivalent: --filter:-domain:www-*.yoyodyne.com --filter:-domain,r:www-.*\.yoyodyne\.com Internally, wget would convert the first into the second and then treat it as a regular expression. For the vast majority of cases, glob will work just fine. One might argue that it's a lot of work to implement regular expressions if the default input format is a glob, but I think we should aim for both lack of confusion and robust functionality. Using ",r" means people get regular expressions when they want them and know what they're doing. The universe of wget users who "know what they're doing" are mostly subscribed to this mailing list; the rest of them send us mail saying "please CC me as I'm not on the list". :-) If we go this route, I'm wondering if the appropriate conversion from glob to regular expression should take directory separators into account, such as: --filter:-path:path/to/* becoming the same as: --filter:-path,r:path/to/[^/]* or even: --filter:-path,r:path[/\\]to[/\\][^/\\]* Should the glob match "path/to/sub/dir"? (I suspect it shouldn't.) Tony
Re: regex support RFC
Mauro Tortonesi wrote: for consistency and to avoid maintenance problems, i would like wget to have the same behavior on windows and unix. please, notice that if we implemented regex support only on unix, windows binaries of wget built with cygwin would have regex support but native binaries wouldn't. that would be very confusing for windows users, IMHO. Ok, I understand. I was thinking in a #ifdef in the source code so you can: - enable all regex code/command line parameters in Unix/Linux - at runtime, print the error "regex not yet supported on windows" if any regex related command parameter line parameter is passed to wget on windows/cygwin this is planned for wget 1.12 (which might become 2.0). i already have some code implementing connection cache data structure. Excelent! URL regex this is planned for wget 1.11. i've already started working on it. looking forward to it, many thanks! -- Oliver Schulze L. <[EMAIL PROTECTED]>
Re: regex support RFC
Hrvoje Niksic wrote: Wincent Colaiuta <[EMAIL PROTECTED]> writes: Are you sure that "www-*" matches "www"? Yes. hrvoje is right. try this perl script: #!/usr/bin/perl -w use strict; my @strings = ("www-.yoyodyne.com", "www.yoyodyne.com"); foreach my $str (@strings) { $str =~ /www-*.yoyodyne.com/ or print "$str doesn't match\n"; } both the strings match. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Wincent Colaiuta <[EMAIL PROTECTED]> writes: > Are you sure that "www-*" matches "www"? Yes. > As far as I know "www-*" matches "one w, another w, a third w, a > hyphen, then 0 or more hyphens". That would be "www--*" or "www-+".
Re: regex support RFC
El 31/03/2006, a las 14:37, Hrvoje Niksic escribió: "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com". Are you sure that "www-*" matches "www"? As far as I know "www-*" matches "one w, another w, a third w, a hyphen, then 0 or more hyphens". In other words, "www", does not match. Wincent smime.p7s Description: S/MIME cryptographic signature
Re: regex support RFC
Hrvoje Niksic wrote: Herold Heiko <[EMAIL PROTECTED]> writes: Get the best of both, use a syntax permitting a "first match-exits" ACL, single ACE permits several statements ANDed together. Cooking up a simple syntax for users without much regexp experience won't be easy. I assume ACL stands for "access control list", but what is ACE? access control entry, i guess. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. Why not? i may be wrong, but if - is not a special charachter, the previous expression should match only domains starting with www- and ending in [randomchar]yoyodyne[randomchar]com. "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com". you're right. ok, it is official. i must stop drinking this much - it just doesn't work. i have to start drinking less or, even better, more. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Mauro Tortonesi <[EMAIL PROTECTED]> writes: >wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. >>> >>> not really. it would not match www.yoyodyne.com. >> Why not? > > i may be wrong, but if - is not a special charachter, the previous > expression should match only domains starting with www- and ending > in [randomchar]yoyodyne[randomchar]com. "*" matches the previous character repeated 0 or more times. This is in contrast to wildcards, where "*" alone matches any character 0 or more times. (This is part of why regexps are often confusing to people used to the much simpler wildcards.) Therefore "www-*" matches "www", "www-", "www--", etc., i.e. Scott's interpretation was correct. What you describe is achieved with the "www-.*.yoyodyne.com".
RE: regex support RFC
> From: Oliver Schulze L. [mailto:[EMAIL PROTECTED] > My personal idea on this is to: enable regex in Unix and > disable it on > Windows. > > We all use Unix/Linux and regex is really usefull. I think not having We all use Unix/Linux ? You would be surprised how many wget users on windows are out there. Beside that, Those Who Know The Code better than me please consider how bad portability issues in using native regexp engines could be. Are the interfaces and capabilities all the same or are there consistent differences between various flavors (gnu, several BSD, hpux, aix, sunos, solaris, older flavours...). If so that would be a point favouring an external library (hopefully supported on as many as possible flavours). Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED] -- +39-041-5907073 / +39-041-5917073 ph -- +39-041-5907472 / +39-041-5917472 fax
Re: regex support RFC
Curtis Hatter wrote: On Thursday 30 March 2006 13:42, Tony Lewis wrote: Perhaps --filter=path,i:/path/to/krs would work. That would look to be the most elegant method. I do hope that the (?i:) and (?-i:) constructs are supported since I may not want the entire path/file to be case (in)?sensitive =), but that will depend on the regex engine chosen. while i like the idea of supporting modifiers like "quick" (short circuit) and maybe "i" (case insensitive comparison), i think that (?i:) and (?-i:) constructs would be overkill and rather hard to implement. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Oliver Schulze L. wrote: Hrvoje Niksic wrote: The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. My personal idea on this is to: enable regex in Unix and disable it on Windows. > We all use Unix/Linux and regex is really usefull. I think not having regex on Windows will not do any more harm that it is doing now (not having it at all) for consistency and to avoid maintenance problems, i would like wget to have the same behavior on windows and unix. please, notice that if we implemented regex support only on unix, windows binaries of wget built with cygwin would have regex support but native binaries wouldn't. that would be very confusing for windows users, IMHO. I hope wget can get conection cache, this is planned for wget 1.12 (which might become 2.0). i already have some code implementing connection cache data structure. URL regex this is planned for wget 1.11. i've already started working on it. and advanced mirror functions (sync 2 folders) in the near future. this is very interesting. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Hrvoje Niksic wrote: Mauro Tortonesi <[EMAIL PROTECTED]> writes: Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. Why not? i may be wrong, but if - is not a special charachter, the previous expression should match only domains starting with www- and ending in [randomchar]yoyodyne[randomchar]com. -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > Scott Scriven wrote: >> * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: >> >>>wget -r --filter=-domain:www-*.yoyodyne.com >> This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", >> "www---.yoyodyne.com", and so on, if interpreted as a regex. > > not really. it would not match www.yoyodyne.com. Why not? >> Perhaps you want glob patterns instead? I know I wouldn't mind >> having glob patterns in addition to regexes... glob is much >> eaesier when you're not doing complex matches. > > no. i was talking about regexps. they are more expressive and > powerful than simple globs. i don't see what's the point in > supporting both. I agree with this.
Re: regex support RFC
Scott Scriven wrote: * Mauro Tortonesi <[EMAIL PROTECTED]> wrote: wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. not really. it would not match www.yoyodyne.com. It would most likely also match "www---zyoyodyneXcom". yes. Perhaps you want glob patterns instead? I know I wouldn't mind having glob patterns in addition to regexes... glob is much eaesier when you're not doing complex matches. no. i was talking about regexps. they are more expressive and powerful than simple globs. i don't see what's the point in supporting both. If I had to choose just one though, I'd prefer to use PCRE, Perl-Compatible Regular Expressions. They offer a richer, more concise syntax than traditional regexes, such as \d instead of [:digit:] or [0-9]. i agree, but adding a dependency from PCRE to wget is asking for infinite maintenance nightmares. and i don't know if we can simply bundle code from PCRE in wget, as it has a BSD license. --filter=[+|-][file|path|domain]:REGEXP is it consistent? is it flawed? is there a more convenient one? It seems like a good idea, but wouldn't actually provide the regex-filtering features I'm hoping for unless there was a "raw" type in addition to "file", "domain", etc. I'll give details below. Basically, I need to match based on things like the inline CSS data, the visible link text, etc. do you mean you would like to have a regex class working on the content of downloaded files as well? Below is the original message I sent to the wget list a few months ago, about this same topic: = I'd find it useful to guide wget by using regular expressions to control which links get followed. For example, to avoid following links based on embedded css styles or link text. I've needed this several times, but the most recent was when I wanted to avoid following any "add to cart" or "buy" links on a site which uses GET parameters instead of directories to select content. Given a link like this... http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album"; class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart ... a useful parameter could be --ignore-regex='AddToCart|add to cart' so the class or link text (really, anything inside the tag) could be used to decide whether the link should be followed. Or... if there's already a way to do this, let me know. I didn't see anything in the docs, but I may have missed something. :) = I think what I want could be implemented via the --filter option, with a few small modifications to what was proposed. I'm not sure exactly what syntax to use, but it should be able to specify whether to include/exclude the link, which PCRE flags to use, how much of the raw HTML tag to use as input, and what pattern to use for matching. Here's an idea: --filter=[allow][flags,][scope][:]pattern Example: '--filter=-i,raw:add ?to ?cart' (the quotes are there only to make the shell treat it as one parameter) The details are: "allow" is "+" for "include" or "-" for "exclude". It defaults to "+" if omitted. "flags," is a set of letters to control regex options, followed by a comma (to separate it from scope). For example, "i" specifies a case-insensitive search. These would be the same flags that perl appends to the end of search patterns. So, instead of "/foo/i", it would be "--filter=+i,:foo" "scope" controls how much of the or similar tag gets used as input to the regex. Values include: raw: use the entire tag and all contents (default) bar domain: use only the domain name www.example.com file: use only the file name foo.ext path: use the directory, but not the file name /path/to others... can be added as desired ":" is required if "allow" or "flags" or "scope" is given So, for example, to exclude the "add to cart" links in my previous post, this could be used: --filter=-raw:'AddToCart|add to cart' or --filter=-raw:AddToCart\|add\ to\ cart or --filter=-:'AddToCart|add to cart' or --filter=-i,raw:'add ?to ?cart' Alternately, the --filter option could be split into two options: one for including content, and one for excluding. This would be more consistent with wget's existing parameters, and would slightly simplify the syntax. I hope I haven't been to full of hot air. This is a feature I've wanted in wget for a long time, and I'm a bit excited that it might happen soon. :) i don't like your "raw" proposal as it is HTML-specific. i would like instead to develop a mechanism which could work for all supported protocols.
Re: regex support RFC
* [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ > > soon leads to non wget related links being downloaded, eg. > http://www.gnu.org/graphics/agnuhead.html In that particular case, I think --no-parent would solve the problem. Maybe I misunderstood, though. It seems awfully risky to use -r and -H without having something to strictly limit the links followed. So, I suppose the content filter would be an effective way to make cross-host downloading safer. I think I'd prefer to have a different option, for that sort of thing -- filter by using external programs. If the program returns a specific code, follow the link or recurse into the links contained in the file. Then you could do far more complex filtering, including things like interactive pruning. -- Scott
Re: regex support RFC
Hrvoje Niksic wrote: The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. My personal idea on this is to: enable regex in Unix and disable it on Windows. We all use Unix/Linux and regex is really usefull. I think not having regex on Windows will not do any more harm that it is doing now (not having it at all) I hope wget can get conection cache, URL regex and advanced mirror functions (sync 2 folders) in the near future. Thats all I still wants from wget and still could not find in another OSS software. Thanks Oliver -- Oliver Schulze L. <[EMAIL PROTECTED]>
Re: regex support RFC
On Thursday 30 March 2006 13:42, Tony Lewis wrote: > Perhaps --filter=path,i:/path/to/krs would work. That would look to be the most elegant method. I do hope that the (?i:) and (?-i:) constructs are supported since I may not want the entire path/file to be case (in)?sensitive =), but that will depend on the regex engine chosen. Curtis
RE: regex support RFC
Curtis Hatter wrote: > Also any way to add modifiers to the regexs? Perhaps --filter=path,i:/path/to/krs would work. Tony
Re: regex support RFC
* Jim Wright <[EMAIL PROTECTED]> wrote: > Suppose you want files from some.dom.com://*/foo/*.png. The > part I'm thinking of here is "foo as last directory component, > and png as filename extension." Can the individual rules be > combined to express this? Only one rule is needed for that pattern: some.dom.com/.*/foo/[^/]*.png$ The file/path/domain specifiers would actually get in the way for this type of matching. What you'd really want is either a full URI/URL match, or a raw tag match. Either would work with the above pattern. -- Scott
Re: regex support RFC
* Mauro Tortonesi <[EMAIL PROTECTED]> wrote: > wget -r --filter=-domain:www-*.yoyodyne.com This appears to match "www.yoyodyne.com", "www--.yoyodyne.com", "www---.yoyodyne.com", and so on, if interpreted as a regex. It would most likely also match "www---zyoyodyneXcom". Perhaps you want glob patterns instead? I know I wouldn't mind having glob patterns in addition to regexes... glob is much eaesier when you're not doing complex matches. If I had to choose just one though, I'd prefer to use PCRE, Perl-Compatible Regular Expressions. They offer a richer, more concise syntax than traditional regexes, such as \d instead of [:digit:] or [0-9]. > --filter=[+|-][file|path|domain]:REGEXP > > is it consistent? is it flawed? is there a more convenient one? It seems like a good idea, but wouldn't actually provide the regex-filtering features I'm hoping for unless there was a "raw" type in addition to "file", "domain", etc. I'll give details below. Basically, I need to match based on things like the inline CSS data, the visible link text, etc. > please notice that supporting multiple comma-separated regexp in a > single --filter option: > > --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,... Commas for multiple regexes are unnecessary. Regexes already have an "or" operator built in. If you want to match "fee" or "fie" or "foe" or "fum", the pattern is fee|fie|foe|fum. > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require that a url passes all the > filtering rules to allow its download (just like the current > -A/R behaviour), or should we instead adopt a short circuit > algorithm that applies all rules in the same order in which > they were given in the command line and immediately allows the > download of an url if it passes the first "allow" match? Regexes implicitly have "or" functionality built in, via the pipe operator. They also have "and" built in simply by extending the pattern. To require both "foo" and "bar" in a match, you could do something like "foo.*bar|bar.*foo". So, it's not strictly necessary to support more than one regex unless you specify both an include pattern and an exclude pattern. However, if multiple patterns are supported, I think it would be more helpful to implement them as "and" rather than "or". This is just because "and" doubles the length of the filter, so it may be more convenient to say "--filter=foo --filter=bar" than "--filter='foo.*bar|bar.*foo'". Below is the original message I sent to the wget list a few months ago, about this same topic: = I'd find it useful to guide wget by using regular expressions to control which links get followed. For example, to avoid following links based on embedded css styles or link text. I've needed this several times, but the most recent was when I wanted to avoid following any "add to cart" or "buy" links on a site which uses GET parameters instead of directories to select content. Given a link like this... http://www.foo.com/forums/gallery2.php?g2_controller=cart.AddToCart&g2_itemId=11436&g2_return=http%3A%2F%2Fwww.foo.com%2Fforums%2Fgallery2.php%3Fg2_view%3Dcore.ShowItem%26g2_itemId%3D11436%26g2_page%3D4%26g2_GALLERYSID%3D1d78fb5be7613cc31d33f7dfe7fbac7b&g2_GALLERYSID=1d78fb5be7613cc31d33f7dfe7fbac7b&g2_returnName=album"; class="gbAdminLink gbAdminLink gbLink-cart_AddToCart">add to cart ... a useful parameter could be --ignore-regex='AddToCart|add to cart' so the class or link text (really, anything inside the tag) could be used to decide whether the link should be followed. Or... if there's already a way to do this, let me know. I didn't see anything in the docs, but I may have missed something. :) = I think what I want could be implemented via the --filter option, with a few small modifications to what was proposed. I'm not sure exactly what syntax to use, but it should be able to specify whether to include/exclude the link, which PCRE flags to use, how much of the raw HTML tag to use as input, and what pattern to use for matching. Here's an idea: --filter=[allow][flags,][scope][:]pattern Example: '--filter=-i,raw:add ?to ?cart' (the quotes are there only to make the shell treat it as one parameter) The details are: "allow" is "+" for "include" or "-" for "exclude". It defaults to "+" if omitted. "flags," is a set of letters to control regex options, followed by a comma (to separate it from scope). For example, "i" specifies a case-insensitive search. These would be the same flags that perl appends to the end of search patterns. So, instead of "/foo/i", it would be "--filter=+i,:foo" "scope" controls how much of the or similar tag gets used as input to the regex. Values include: raw: use the entire tag and all contents (default) bar domain: use only the domain name www.example.com file: use only the file name foo.ext path: use the directory, but not the file name
Re: regex support RFC
On Thursday 30 March 2006 11:49, you wrote: > How many keywords do we need to provide maximum flexibility on the > components of the URI? (I'm thinking we need five.) > > Consider http://www.example.com/path/to/script.cgi?foo=bar > > --filter=uri:regex could match against any part of the URI > --filter=domain:regex could match against www.example.com > --filter=path:regex could match against /path/to/script.cgi > --filter=file:regex could match against script.cgi > --filter=query:regex could match against foo=bar > > I think there are good arguments for and against matching against the file > name in "path:" > > Tony The query keyword is a great idea. So many of the sites I download from use that, and would greatly help in limiting the material that is downloaded. I was also wondering this, does the "path:" need the begin and end slashes or are those assumed? They could be assumed, but if you combine the "file:" with the path I'm not sure you can make that assumption anymore. This comes into play when wanting to search from the start, or at the end of a path. --filter=path:^path/to/files or --filter=path:^/path/to/files --filter=path:path/to/files$ or --filter=path:path/to/files/$ Also any way to add modifiers to the regexs? The only modifier I can think of off the top of my head that would see much use is /i. I download material from a site where for some reason they use a KRS and a krs interchangeably in the path name. So something akin to: "--filter=path:^path/to/(?i:krs)" would be helpful. Or some other way to include modifiers? Curtis
RE: regex support RFC
How many keywords do we need to provide maximum flexibility on the components of the URI? (I'm thinking we need five.) Consider http://www.example.com/path/to/script.cgi?foo=bar --filter=uri:regex could match against any part of the URI --filter=domain:regex could match against www.example.com --filter=path:regex could match against /path/to/script.cgi --filter=file:regex could match against script.cgi --filter=query:regex could match against foo=bar I think there are good arguments for and against matching against the file name in "path:" Tony
Re: regex support RFC
On Wednesday 29 March 2006 12:05, you wrote: > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require that a url passes all the filtering > rules to allow its download (just like the current -A/R behaviour), or > should we instead adopt a short circuit algorithm that applies all rules > in the same order in which they were given in the command line and > immediately allows the download of an url if it passes the first "allow" > match? should we also support apache-like deny-from-all and > allow-from-all policies? and what would be the best syntax to trigger > the usage of these policies? I would recommend parsing the filters in the order given, that puts the onus on the user to optimize the filters and not you. Another way could possibly be all filters by domain, then path, and finally file. Regardless of how you ultimately decide to order the filters, would it be possible to allow for users to specify a short circuit? I'm thinking something similar to PF's (http://www.openbsd.org/faq/pf/filter.html#quick) quick keyword. Example usage of this would be something like: Need to mirror a site that uses several domains: --filter=+domain:example.(net|org|com) Within that domain several paths. One of those paths, which is four levels deep, I know I want everything regardless of it's file name/type/etc. It's four levels deep. --filter=+path,quick:([^/]+/){3}/thefiles The "quick" keyword is used to skip all other filters, because I've told wget that I'm sure I want everything in that path if it matches. Wget would first evaluate the domain, if it passes evaluate the path and if that passes then skip all other filters. Should it fail, wget continues to evaluate the rest of the filters. Another example: I know I want nothing from any site other than example.com --filter=-domain,quick:^(?!example.com) That should ignore any domain that doesn't begin with example.com and skip all other rules because of the "quick" keyword. This would make processing more efficient, since other filters don't have to be evaluated. Curtis
Re: regex support RFC
On Thu, 30 Mar 2006, Mauro Tortonesi wrote: > > > I do like the [file|path|domain]: approach. very nice and flexible. > > (and would be a huge help to one specific need I have!) I suggest also > > including an "any" option as a shortcut for putting the same pattern in > > all three options. > > do you think the "any" option would be really useful? if so, could you please > give us an example? Depends on how individual [file|path|domain]: entries are combined. AND, OR? Suppose you want files from some.dom.com://*/foo/*.png. The part I'm thinking of here is "foo as last directory component, and png as filename extension." Can the individual rules be combined to express this? I guess the real question is, how are rules combined. Jim
RE: regex support RFC
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > > I agree. Just how often will there be problems in a single > wget run due to > > both some.domain.com and somedomain.com present (famous last > > words...) > > Actually it would have to be somedomain.com -- a "." > will not match the null string. My point was that people who care yes, sorry. > about that potential problem will carefully quote their dots, while > the rest of us will use the more convenient notation. I am of the same opinion, I'm just wondering how often the "correct" notations will be neccessary, I don't think in my personal experience something like that would ever happen, but what with the pletora if similar domain existing (and even worse, the phishing domains...) Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED] -- +39-041-5907073 / +39-041-5917073 ph -- +39-041-5907472 / +39-041-5917472 fax
Re: regex support RFC
Herold Heiko <[EMAIL PROTECTED]> writes: >> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] >> I don't think such a thing is necessary in practice, though; remember >> that even if you don't escape the dot, it still matches the (intended) >> dot, along with other characters. So for quick&dirty usage not >> escaping dots will "just work", and those who want to be precise can >> escape them. > > I agree. Just how often will there be problems in a single wget run due to > both some.domain.com and somedomain.com present (famous last > words...) Actually it would have to be somedomain.com -- a "." will not match the null string. My point was that people who care about that potential problem will carefully quote their dots, while the rest of us will use the more convenient notation.
RE: regex support RFC
> From: Hrvoje Niksic [mailto:[EMAIL PROTECTED] > I don't think such a thing is necessary in practice, though; remember > that even if you don't escape the dot, it still matches the (intended) > dot, along with other characters. So for quick&dirty usage not > escaping dots will "just work", and those who want to be precise can > escape them. I agree. Just how often will there be problems in a single wget run due to both some.domain.com and somedomain.com present (famous last words...) Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED] -- +39-041-5907073 / +39-041-5917073 ph -- +39-041-5907472 / +39-041-5917472 fax
Re: regex support RFC
Herold Heiko <[EMAIL PROTECTED]> writes: > Get the best of both, use a syntax permitting a "first match-exits" > ACL, single ACE permits several statements ANDed together. Cooking > up a simple syntax for users without much regexp experience won't be > easy. I assume ACL stands for "access control list", but what is ACE? > One way (probably not the most beautiful syntax) could be a running > number, The numbers are just too ugly, sorry. Also, having *two* instances of +/- (one before the "=" and one after the "=") is just too confusing; it took me a minute or two to figure out. > I realize much of this syntax can be thrown out of the window simply > considering we can probably reach the same effect with uri filters and more > complicated regexp (perl5 syntax): > --filter=+uri:.+\.dom\.com/*.download > --filter=-domain:sweets\.dom\.com > --filter="+uri:peanuts\.dom\.com/.*(?!brown)" > --filter=+path:peanuts That seems much more acceptable IMHO.
Re: regex support RFC
Herold Heiko <[EMAIL PROTECTED]> writes: > BTW any comments about the dots ? Requiring escaped dots in domains would > become old really fast, reversing behaviour (\. = any char) would be against > the principle of least surprise, since any other regexp syntax does use the > opposite. Modifying the dot to only match a dot might be useful for "domain" patterns, but I suspect it's not easy to implement. I don't think such a thing is necessary in practice, though; remember that even if you don't escape the dot, it still matches the (intended) dot, along with other characters. So for quick&dirty usage not escaping dots will "just work", and those who want to be precise can escape them. > Either way pure windows users will be confused (*.html instead of > .*\.html), Increased expressive power will hopefully outweigh the confusion. After all, people who use Wget on Windows are hardly typical Windows users. :-) > but personally I don't think permitting yet another alternate syntax > (using globs) is justified, and a syntax using exclusively globs > would be too limited. My thoughts exactly.
RE: regex support RFC
[Immagination running freely, I do not have a lot of experience designing syntax, but I suffer a lot in a helpdeskish way trying to explain syntax to users. Hopefully this can be somehow useful] > we also have to reach consensus on the filtering algorithm. for > instance, should we simply require that a url passes all the > filtering > rules to allow its download (just like the current -A/R > behaviour), or > should we instead adopt a short circuit algorithm that > applies all rules > in the same order in which they were given in the command line and > immediately allows the download of an url if it passes the > first "allow" > match? should we also support apache-like deny-from-all and > allow-from-all policies? and what would be the best syntax to trigger > the usage of these policies? > Get the best of both, use a syntax permitting a "first match-exits" ACL, single ACE permits several statements ANDed together. Cooking up a simple syntax for users without much regexp experience won't be easy. One way (probably not the most beautiful syntax) could be a running number, AND together repeated filters with the same number but use FIRST MATCH between numbers: download every path containing download on every *.dom.com (including sweets.dom.com); OTHERWISE avoid anything (else) on sweets.dom.com; OTHERWISE from peanuts.dom.com get everything except brown stuff (currants and so on); OTHERWISE get peanuts from everywhere else --filter1+=+domain:.+\.dom\.com --filter1=+path:download && --filter2-=+domain:sweets\.dom\.com && --filter3+=+peanuts\.dom\.com --filter3=-file:brown && --filter4+=+path:peanuts && (&& omitted later on) The first filterX (for every X) does carry a +/- before the = (permit/deny ACE), every filterX does carry a + or - after the = (what are whe matching). Well, I wrote the example and I hate it already, hopefully some better syntax comes up which doesn't require nested quotes. Require an additional switch permit/deny for every ACE: --filter1=permit --filter1=+domain:.+\.dom\.com --filter1=+path:download --filter2=deny --filter2=+domain:sweets\.dom\.com --filter3=permit --filter3=+peanuts\.dom\.com --filter3=-file:brown --filter4=permit --filter4=+path:peanuts With permit and + as default that would make --filter1=domain:.+\.dom\.com --filter1=path:download --filter2=deny --filter2=domain:sweets\.dom\.com --filter3=peanuts\.dom\.com --filter3=-file:brown --filter4=path:peanuts On the other hand, without the default=permit we could loose the numbers (use position): --filter=permit --filter=+domain:.+\.dom\.com --filter=+path:download --filter=deny --filter=+domain:sweets\.dom\.com --filter=permit --filter=+peanuts\.dom\.com --filter=-file:brown --filter=permit --filter=+path:peanuts e.g. start with permit or deny (or default permit for first ACE only), following statements are ANDed together as a single ACE until next permit/deny. Considering command line restrictions and so on for complicated expression there should also be a --filter-file=filename, same syntax except the --filter ? I realize much of this syntax can be thrown out of the window simply considering we can probably reach the same effect with uri filters and more complicated regexp (perl5 syntax): --filter=+uri:.+\.dom\.com/*.download --filter=-domain:sweets\.dom\.com --filter="+uri:peanuts\.dom\.com/.*(?!brown)" --filter=+path:peanuts Simpler and shorter invocation syntax, but more complicated regexp requirement, not a simple thing for the casual user, after all wget doesn't try to appeal to programmers only, many examples in the manual will be neccessary. BTW any comments about the dots ? Requiring escaped dots in domains would become old really fast, reversing behaviour (\. = any char) would be against the principle of least surprise, since any other regexp syntax does use the opposite. Either way pure windows users will be confused (*.html instead of .*\.html), but personally I don't think permitting yet another alternate syntax (using globs) is justified, and a syntax using exclusively globs would be too limited. Heiko -- -- PREVINET S.p.A. www.previnet.it -- Heiko Herold [EMAIL PROTECTED] [EMAIL PROTECTED] -- +39-041-5907073 / +39-041-5917073 ph -- +39-041-5907472 / +39-041-5917472 fax
Re: regex support RFC
Jim Wright wrote: what definition of regexp would you be following? that's another degree of liberty. hrovje and i have chosen to integrate in wget the GNU regex implementation, which allows the exploitation of one of these different syntaxes: RE_SYNTAX_EMACS RE_SYNTAX_AWK RE_SYNTAX_GNU_AWK RE_SYNTAX_POSIX_AWK RE_SYNTAX_GREP RE_SYNTAX_EGREP RE_SYNTAX_POSIX_EGREP RE_SYNTAX_POSIX_BASIC RE_SYNTAX_POSIX_MINIMAL_BASIC RE_SYNTAX_POSIX_EXTENDED RE_SYNTAX_POSIX_MINIMAL_EXTENDED (see http://cvs.savannah.gnu.org/viewcvs/emacs/emacs/src/regex.h?view=markup) among these, i would probably go for a POSIX_EXTENDED syntax. I'm not quite understanding the comment about the comma and needing escaping for literal commas. this is true for any character in the regexp language, so why the special concern for comma? hrvoje already answered to this question. I do like the [file|path|domain]: approach. very nice and flexible. (and would be a huge help to one specific need I have!) I suggest also including an "any" option as a shortcut for putting the same pattern in all three options. do you think the "any" option would be really useful? if so, could you please give us an example? -- Aequam memento rebus in arduis servare mentem... Mauro Tortonesi http://www.tortonesi.com University of Ferrara - Dept. of Eng.http://www.ing.unife.it GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net Ferrara Linux User Group http://www.ferrara.linux.it
Re: regex support RFC
Jim Wright <[EMAIL PROTECTED]> writes: > what definition of regexp would you be following? or would this be > making up something new? It wouldn't be new, Mauro is definitely referring to regexps as normally understood. The regexp API's found on today's Unix systems might be usable, but unfortunately those are not available on Windows. They also lack the support for the very useful non-greedy matching quantifier (the "?" modifier to the "*" operator) introduced by Perl 5 and supported by most of today's major regexp implementations: Python, Java, Tcl, etc. One idea was to use PCRE, bundling it with Wget for the sake of Windows and systems without PCRE. Another (http://tinyurl.com/elp7h) was to use and bundle Emacs's regex.c, the version of GNU regex shipped with GNU Emacs. It is small (one source) and offers Unix-compatible basic and extended regeps, but also supports the non-greedy quantifier and non-capturing groups. See the message and the related discussion at http://tinyurl.com/mdwhx for more about this topic. > I'm not quite understanding the comment about the comma and needing > escaping for literal commas. Supporting PATTERN1,PATTERN2,... would require having a way to quote the comma character. But there is little reason for a specific comma syntax since one can always use (PATTERN1|PATTERN2|...). Being unable to have a comma in the pattern is a shortcoming in the current -R/-A options. > I do like the [file|path|domain]: approach. very nice and flexible. Thanks.
Re: regex support RFC
Mauro Tortonesi <[EMAIL PROTECTED]> writes: > for instance, the syntax for --filter presented above is basically the > following: > > --filter=[+|-][file|path|domain]:REGEXP I think there should also be "url" for filtering on the entire URL. People have been asking for that kind of thing a lot over the years.
Re: regex support RFC
> for instance, the syntax for --filter presented above is basically the > following: > > --filter=[+|-][file|path|domain]:REGEXP I think a file 'contents' regexp search facility would be a useful addition here. eg. --filter=[+|-][file|path|domain|contents]:REGEXP The idea is that if the file just downloaded has a regexp match for expression REGEXP (ie. as in 'egrep REGEXP file.html') then that file is kept and its links processed as normal. If no match is found the file is just deleted. Such a facility could be used to prevent recursive downloads wandering way off topic. eg. wget -e robots=off -r -N -k -E -p -H http://www.gnu.org/software/wget/ soon leads to non wget related links being downloaded, eg. http://www.gnu.org/graphics/agnuhead.html My suggestion is that with; wget -e robots=off -r -N -k -E -p -H --filter=+contents:wget http://www.gnu.org/software/wget/ any page not containing the string 'wget' is deleted and its links not followed. Thanks Tom Crane -- Tom Crane, Dept. Physics, Royal Holloway, University of London, Egham Hill, Egham, Surrey, TW20 0EX, England. Email: [EMAIL PROTECTED] Fax:+44 (0) 1784 472794
Re: regex support RFC
what definition of regexp would you be following? or would this be making up something new? I'm not quite understanding the comment about the comma and needing escaping for literal commas. this is true for any character in the regexp language, so why the special concern for comma? I do like the [file|path|domain]: approach. very nice and flexible. (and would be a huge help to one specific need I have!) I suggest also including an "any" option as a shortcut for putting the same pattern in all three options. Jim On Wed, 29 Mar 2006, Mauro Tortonesi wrote: > > hrvoje and i have been recently talking about adding regex support to wget. we > were considering to add a new --filter option which, by supporting regular > expressions, would allow more powerful ways of filtering urls to download. > > for instance the new option could allow the filtering of domain names, file > names and url paths. in the following case --filter is used to prevent any > download from the www-*.yoyodyne.com domain and to restrict download only to > .gif files: > > wget -r --filter=-domain:www-*.yoyodyne.com --filter=+file:\.gif$ > http://yoyodyne.com > > (notice that --filter interprets every given rule as a regex). > > i personally think the --filter option would be a great new feature for wget, > and i have already started working on its implementation, but we still have a > few opened questions. > > for instance, the syntax for --filter presented above is basically the > following: > > --filter=[+|-][file|path|domain]:REGEXP > > is it consistent? is it flawed? is there a more convenient one? > > please notice that supporting multiple comma-separated regexp in a single > --filter option: > > --filter=[+|-][file|path|domain]:REGEXP1,REGEXP2,... > > would significantly complicate the implementation and usage of --filter, as it > would require escaping of the "," charachter. also notice that current > filtering options like -A/R are somewhat broken, as they do not allow the > usage of "," char in filtering rules. > > we also have to reach consensus on the filtering algorithm. for instance, > should we simply require that a url passes all the filtering rules to allow > its download (just like the current -A/R behaviour), or should we instead > adopt a short circuit algorithm that applies all rules in the same order in > which they were given in the command line and immediately allows the download > of an url if it passes the first "allow" match? should we also support > apache-like deny-from-all and allow-from-all policies? and what would be the > best syntax to trigger the usage of these policies? > > i am looking forward to read your opinions on this topic. > > > P.S.: the new --filter option would replace and extend the old -D, -I/X > and -A/R options, which will be deprecated but still supported. > > -- > Aequam memento rebus in arduis servare mentem... > > Mauro Tortonesi http://www.tortonesi.com > > University of Ferrara - Dept. of Eng.http://www.ing.unife.it > GNU Wget - HTTP/FTP file retrieval tool http://www.gnu.org/software/wget > Deep Space 6 - IPv6 for Linuxhttp://www.deepspace6.net > Ferrara Linux User Group http://www.ferrara.linux.it > >