Re: how to filter only certain URL's?
[EMAIL PROTECTED] (Gary Funck) writes: Thanks for the clarification. The 'info' page helped clear things up: (topic: Types of Files) `-A ACCLIST' `--accept ACCLIST' `accept = ACCLIST' The argument to `--accept' option is a list of file suffixes or patterns that Wget will download during recursive retrieval. A suffix is the ending part of a file, and consists of "normal" letters, e.g. `gif' or `.jpg'. A matching pattern contains shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'. So, specifying `wget -A gif,jpg' will make Wget download only the files ending with `gif' or `jpg', i.e. GIFs and JPEGs. On the other hand, `wget -A "zelazny*196[0-9]*"' will download only files beginning with `zelazny' and containing numbers from 1960 to 1969 anywhere within. Look up the manual of your shell for a description of how pattern matching works. but the man page that I checked first, didn't talk about the pattern matching capabilities: Wget hasn't been distributed with an official man page for some time. Linux vendors often include an old one. Starting in 1.7, an official manpage (autogenerated from the .texi) will be included again. In my application, I wanted to apply the pattern to the *entire* URL, inclusive of intervening directories. I made a small modification to the source code to implement full-URL pattern matching: *** utils.c.origSun Jun 25 23:11:44 2000 --- utils.c Mon Feb 19 20:00:14 2001 *** *** 546,557 --- 546,559 int acceptable (const char *s) { + #ifdef MATCH_ONLY_LAST_PART_OF_URL int l = strlen (s); while (l s[l] != '/') --l; if (s[l] == '/') s += (l + 1); + #endif if (opt.accepts) { if (opt.rejects) I think we'll have to hold off on making a change like this until we determine whether we'll be added full regexp matching ability or not. Another suggestion: if the user supplies either --accept or --reject options, it is possible that the resulting mirrored file hierarchy will contain empty directories. Although a shell script could easily clean them up, as in: find . -type d -exec rmdir --ignore-fail-on-non-empty {} \; it would be nice if wget took that action as a final clean up. Okay, I'll add that to the TODO. --- Dan Harkless| To help prevent SPAM contamination, GNU Wget co-maintainer | please do not mention this email http://sunsite.dk/wget/ | address in Usenet posts -- thank you.
Re: how to filter only certain URL's?
[EMAIL PROTECTED] (Gary Funck) writes: Hello, I have an application where I want to traverse a given site, but only retrieve pages with a URL that matches a particular pattern. The pattern would include a specific directory, and a file name that has a particular form. If wget won't accept a general pattern, I'd like it if wget would just return the URL's it finds during its recursive traversal, but not return the data. Given the list of URL's, I can filter the ones out that I'm interested in, and only fetch those. Here's an example - assume that I'm interested in fetching all FAQ pages that have linux in their file name. Using conventional grep patterns, I might be interested in URL's of the form: '.*/faqs/.*linux.*\.html', for example. Is there a way to do something like this in wget, or some other program? Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold, which has the Perl regular expression library integrated for use with --accept and --reject (see attached mail). At this time, no opinion has been expressed as to whether the mainline Wget should include this change. --- Dan Harkless| To help prevent SPAM contamination, GNU Wget co-maintainer | please do not mention this email http://sunsite.dk/wget/ | address in Usenet posts -- thank you. Hi, I included the Perl regular expressions (PCRE) library from ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/ to get better filename matching capabilities for "--accept" and "--reject" options. For now, it is an compile time option. You must specify "--with-pcre" as an option for the configure script and have pcre = 3.0 (RedHat and SuSE RPMs are available - use http://ftpsearch.ntnu.no ) installed. I just uploaded pcre-3.1-47 to RedHat, it should be available on every http://contrib.redhat.com mirror in the next few days. If you include this feature, any Perl-ish regex can be used on "--accept" and "--reject" options, the old aren't the supported any more. Examples: -A 'HTML,HTM,JPG,JPEG' is now: -A 'HTM.?$|JP.?G$' Or when mirroring Apache servers: -R '^\?.=[AD]' This seems to be quite complicated for those who did never use Perl, but it is powerful as hell. BTW: Matching is done case-independently. Attached is a patch against my last "gold" version. The most recent package for now is http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz Or try http://www.simprovement.com/linux/frech.cgi My "gold" version includes some patches from this list (e.g. preliminary HTTPS support) and my own extensions. Please send me any patches/bugfixes you make! Bye, -- Soenke Jan Peters |_ _| Rostock, Germany PGP on request, ._|| [EMAIL PROTECTED]Subject: getpgpkey http://www.simprovement.comNo solicitations! diff -urN --exclude=configure wget-1.5.3gold/ChangeLog wget-1.5.3gold2/ChangeLog --- wget-1.5.3gold/ChangeLogWed May 24 22:29:18 2000 +++ wget-1.5.3gold2/ChangeLog Fri Jun 30 00:39:45 2000 @@ -1,3 +1,15 @@ +2000-06-29 Soenke J. Peters [EMAIL PROTECTED] + * PCRE library option at compile time + for --accept and --reject options + + * HTTPS support via OpenSSL + + * Alternative print_percentage() function + + * --filter-script option + + * Referrer faking via --referer=fake + 2000-05-24 Dan Harkless [EMAIL PROTECTED] * TODO: Timestamps sometimes not copied over on files retrieved by FTP. diff -urN --exclude=configure wget-1.5.3gold/Makefile.in wget-1.5.3gold2/Makefile.in --- wget-1.5.3gold/Makefile.in Mon Jun 26 07:10:44 2000 +++ wget-1.5.3gold2/Makefile.in Fri Jun 30 00:23:19 2000 @@ -47,7 +47,8 @@ CPPFLAGS = @CPPFLAGS@ -I$(SSL_INCDIR) DEFS = @DEFS@ -DSYSTEM_WGETRC=\"$(sysconfdir)/wgetrc\" -DLOCALEDIR=\"$(localedir)\" SSL_LIBS = @SSL_LIBS@ -LIBS = @LIBS@ $(SSL_LIBS) +PCRE_LIBS = @PCRE_LIBS@ +LIBS = @LIBS@ $(SSL_LIBS) $(PCRE_LIBS) SSL_LIBDIR = @SSL_LIBDIR@ LDFLAGS = @LDFLAGS@ diff -urN --exclude=configure wget-1.5.3gold/NEWS wget-1.5.3gold2/NEWS --- wget-1.5.3gold/NEWS Mon Jun 26 10:59:15 2000 +++ wget-1.5.3gold2/NEWSFri Jun 30 00:38:00 2000 @@ -5,6 +5,11 @@ Please send GNU Wget bug reports to [EMAIL PROTECTED]. +* Changes in Wget 1.5.3gold2 + +** PCRE library option at compile time for --accept and --reject options + + * Changes in Wget 1.5.3gold ** HTTPS support via OpenSSL diff -urN --exclude=configure wget-1.5.3gold/configure.in wget-1.5.3gold2/configure.in --- wget-1.5.3gold/configure.in Mon Jun 26 05:28:27 2000 +++ wget-1.5.3gold2/configure.inFri Jun 30 00:54:03 2000 @@ -61,6 +61,20 @@ [ --with-ssl-libdir=DIR where to find SSLeay library files (optional)], ssl_libdir="$withval", ssl_libdir="/usr/lib") +AC_ARG_WITH(pcre, +[ --with-pcre compile in PCRE support], +[if test "$withval" != "no" ; then +
Re: how to filter only certain URL's?
On Feb 19, 1:51pm, Dan Harkless wrote: Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold, which has the Perl regular expression library integrated for use with --accept and --reject (see attached mail). At this time, no opinion has been expressed as to whether the mainline Wget should include this change. Dan, thanks. Looks like that would be just the ticket, however, at the present time, this URL (and the top-level web site) come up empty: http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz Date: Fri, 30 Jun 2000 01:43:15 +0200 From: "Soenke J. Peters" [EMAIL PROTECTED] [...] If you include this feature, any Perl-ish regex can be used on "--accept" and "--reject" options, the old aren't the supported any more. Examples: -A 'HTML,HTM,JPG,JPEG' is now: -A 'HTM.?$|JP.?G$' Or when mirroring Apache servers: -R '^\?.=[AD]' This seems to be quite complicated for those who did never use Perl, but it is powerful as hell. BTW: Matching is done case-independently. Attached is a patch against my last "gold" version. The most recent package for now is http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz Or try http://www.simprovement.com/linux/frech.cgi My "gold" version includes some patches from this list (e.g. preliminary HTTPS support) and my own extensions. Would you happen to have another URL/source for this version of wget (I checked a few of the wget sites, but didn't find it). Thanks.
Re: how to filter only certain URL's?
On Mon, 19 Feb 2001, Dan Harkless wrote: Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold, which has the Perl regular expression library integrated for use with --accept and --reject (see attached mail). At this time, no opinion has been expressed as to whether the mainline Wget should include this change. Being a [useful-]feature junkie when it comes to wget I feel this could be included without making the program bloated in any way. Regular expressions are a good thing, almost always, and it would serve wget's purpose just great if it works flawless as described (haven't had time to patch and test myself) since perl is the grand master when it comes to regular expressions. Maybe with a brief introduction to regular expressions in the docs/info I think more people would get less files they don't want when they mirror a page or an ftp (making less traffic on the network = good), and at the same time you could have _very_ complex mirrors. Although replacing the current -A and -R options might not be such a good idea, since older scripts depending on wget with -A,-R and people who use them now regulary would be confused if it just stopped working.. backward compability is always good, but then again it's a pain when it restrain progress - and I personally have no clue how this could be implemented as it is right now without letting go of the old -A,-R style, if you don't create brand new options of course. Then again, having options for _both_ perl regular expressions _and_ the old style sounds like an even worse idea. (I'm so happy I'm just a user and not a developer! :-) ) Best regards Henrik van Ginhoven, Sweden 9799-5 Everyone a story but they end the same
Re: how to filter only certain URL's?
On Feb 19, 3:09pm, Gary Funck wrote: Dan, thanks. Looks like that would be just the ticket, however, at the present time, this URL (and the top-level web site) come up empty: http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz Would you happen to have another URL/source for this version of wget (I checked a few of the wget sites, but didn't find it). Thanks. Using google, I found this source RPM. A little out of the way for US users, but here it is nevertheless: ftp://ftp.lip6.fr/pub/linux/distributions/redhat-contrib/libc6/SRPMS/wget-1.5.3gold-1.src.rpm
Re: how to filter only certain URL's?
"[EMAIL PROTECTED]" [EMAIL PROTECTED] writes: Being a [useful-]feature junkie when it comes to wget I feel this could be included without making the program bloated in any way. Regular expressions are a good thing, almost always, and it would serve wget's purpose just great if it works flawless as described (haven't had time to patch and test myself) since perl is the grand master when it comes to regular expressions. Maybe with a brief introduction to regular expressions in the docs/info I think more people would get less files they don't want when they mirror a page or an ftp (making less traffic on the network = good), and at the same time you could have _very_ complex mirrors. Although replacing the current -A and -R options might not be such a good idea, since older scripts depending on wget with -A,-R and people who use them now regulary would be confused if it just stopped working.. backward compability is always good, but then again it's a pain when it restrain progress - and I personally have no clue how this could be implemented as it is right now without letting go of the old -A,-R style, if you don't create brand new options of course. Then again, having options for _both_ perl regular expressions _and_ the old style sounds like an even worse idea. Yes, the fact that you would now have to understand regexps to use -A and -R is one reason that patch didn't get immediately thrown into the main Wget. One way to go might be to have a --regexp option which turns on regexp support in all options that have it (probably starting with just -A and -R). Those that wanted to could turn the option on all the time in their .wgetrc. Also we'd have to make sure there are no problems bundling PCRE (or another regexp library, if necessary) with Wget. --- Dan Harkless| To help prevent SPAM contamination, GNU Wget co-maintainer | please do not mention this email http://sunsite.dk/wget/ | address in Usenet posts -- thank you.
Re: how to filter only certain URL's?
[EMAIL PROTECTED] (Gary Funck) writes: Dan, sorry to trouble you Please post to [EMAIL PROTECTED] rather that putting the onus on one person to answer you (and depriving everyone else of the information). - but that RPM URL that I mentioned, appeared to be version that has pattern matching in it, but now it appears that this version has some sort of shell-like globbing, but doesn't have the regex stuff. I actually would prefer the regex version for whatI'm trying to do, and there's no docs. on how the globbing works, how much of a pathname I can use it on, etc. (ie, does it only match the part of the URL after the rightmost slash?) If you can point me at a copy of the regex version, that'd be great. thanks, - Gary The sum of my knowledge on 1.5.3gold is what I read in the author's post. You could try emailing him. Also, perhaps the patch in the email I forwarded successfully applies to the RPM'd version and converts it to regexp. --- Dan Harkless| To help prevent SPAM contamination, GNU Wget co-maintainer | please do not mention this email http://sunsite.dk/wget/ | address in Usenet posts -- thank you.
Re: how to filter only certain URL's?
On Feb 19, 5:17pm, Dan Harkless wrote: Subject: Re: how to filter only certain URL's? [EMAIL PROTECTED] (Gary Funck) writes: Dan, sorry to trouble you Please post to [EMAIL PROTECTED] rather that putting the onus on one person to answer you (and depriving everyone else of the information). Okay. no prob. - but that RPM URL that I mentioned, appeared to be version that has pattern matching in it, but now it appears that this version has some sort of shell-like globbing, but doesn't have the regex stuff. I actually would prefer the regex version for what I'm trying to do, and there's no docs. on how the globbing works, how much of a pathname I can use it on, etc. (ie, does it only match the part of the URL after the rightmost slash?) I experimented with the above mentioned "globbing", but couldn't figure out how it works (though admittedly I didn't try firing up the debugger to see what's going on). One thing that the matching did appear to doing however, is first *downloading the entire page* before making the decision as to whether to keep the page or not. This is decidely not the preferred implementationn -- it wastes bandwidth. If you can point me at a copy of the regex version, that'd be great. thanks, - Gary The sum of my knowledge on 1.5.3gold is what I read in the author's post. You could try emailing him. Also, perhaps the patch in the email I forwarded successfully applies to the RPM'd version and converts it to regexp. Thanks - I didn't notice the patchset attachment when I first read your e-mail. I'll give it a shot.
Re: how to filter only certain URL's?
On Feb 19, 5:55pm, Gary Funck wrote: One thing that the matching did appear to doing however, is first *downloading the entire page* before making the decision as to whether to keep the page or not. This is decidely not the preferred implementationn -- it wastes bandwidth. Well, wget does have to look into each page in order to perform its recursive traversal, so it downloads the page first. However, there are cases when the -l (max. number of levels) option is asserted where wget would know by definition that the URL it is looking at is a leaf node (no further traversals would be allowd by the -l option), and in that case downloading the page is wasted effort, if its name doesn't match the -A requirement.