Re: how to filter only certain URL's?

2001-02-22 Thread Dan Harkless


[EMAIL PROTECTED] (Gary Funck) writes:
 Thanks for the clarification.  The 'info' page helped clear things up:
 
 (topic: Types of Files)
 
 `-A ACCLIST'
 `--accept ACCLIST'
 `accept = ACCLIST'
  The argument to `--accept' option is a list of file suffixes or
  patterns that Wget will download during recursive retrieval.  A
  suffix is the ending part of a file, and consists of "normal"
  letters, e.g. `gif' or `.jpg'.  A matching pattern contains
  shell-like wildcards, e.g. `books*' or `zelazny*196[0-9]*'.
 
  So, specifying `wget -A gif,jpg' will make Wget download only the
  files ending with `gif' or `jpg', i.e. GIFs and JPEGs.  On the
  other hand, `wget -A "zelazny*196[0-9]*"' will download only files
  beginning with `zelazny' and containing numbers from 1960 to 1969
  anywhere within.  Look up the manual of your shell for a
  description of how pattern matching works.
 
 but the man page that I checked first, didn't talk about the pattern
 matching capabilities:

Wget hasn't been distributed with an official man page for some time.  Linux
vendors often include an old one.  Starting in 1.7, an official manpage
(autogenerated from the .texi) will be included again.

 In my application, I wanted to apply the pattern to the *entire* URL,
 inclusive of intervening directories.  I made a small modification to
 the source code to implement full-URL pattern matching:
 
 *** utils.c.origSun Jun 25 23:11:44 2000
 --- utils.c Mon Feb 19 20:00:14 2001
 ***
 *** 546,557 
 --- 546,559 
   int
   acceptable (const char *s)
   {
 + #ifdef MATCH_ONLY_LAST_PART_OF_URL
 int l = strlen (s);
   
 while (l  s[l] != '/')
   --l;
 if (s[l] == '/')
   s += (l + 1);
 + #endif
 if (opt.accepts)
   {
 if (opt.rejects)

I think we'll have to hold off on making a change like this until we
determine whether we'll be added full regexp matching ability or not.

 Another suggestion: if the user supplies either --accept or --reject
 options, it is possible that the resulting mirrored file hierarchy
 will contain empty directories.  Although a shell script could
 easily clean them up, as in:
find . -type d -exec rmdir --ignore-fail-on-non-empty {} \;
 it would be nice if wget took that action as a final clean up.

Okay, I'll add that to the TODO.

---
Dan Harkless| To help prevent SPAM contamination,
GNU Wget co-maintainer  | please do not mention this email
http://sunsite.dk/wget/ | address in Usenet posts -- thank you.



Re: how to filter only certain URL's?

2001-02-19 Thread Dan Harkless

[EMAIL PROTECTED] (Gary Funck) writes:
 Hello,
 
 I have an application where I want to traverse a given site, but only
 retrieve pages with a URL that matches a particular pattern.  The
 pattern would include a specific directory, and a file name that
 has a particular form.  If wget won't accept a general pattern, I'd
 like it if wget would just return the URL's it finds during its
 recursive traversal, but not return the data.  Given the list of
 URL's, I can filter the ones out that I'm interested in, and only
 fetch those.  Here's an example - assume that I'm interested in
 fetching all FAQ pages that have linux in their file name.  Using
 conventional grep patterns, I might be interested in URL's of the
 form: '.*/faqs/.*linux.*\.html', for example.  Is there a way to
 do something like this in wget, or some other program?

Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold,
which has the Perl regular expression library integrated for use with
--accept and --reject (see attached mail).

At this time, no opinion has been expressed as to whether the mainline Wget
should include this change.

---
Dan Harkless| To help prevent SPAM contamination,
GNU Wget co-maintainer  | please do not mention this email
http://sunsite.dk/wget/ | address in Usenet posts -- thank you.





Hi,

I included the Perl regular expressions (PCRE) library from
  ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/
to get better filename matching capabilities for "--accept" and
"--reject" options.

For now, it is an compile time option.
You must specify "--with-pcre" as an option for the configure script and
have pcre = 3.0 (RedHat and SuSE RPMs are available - use
http://ftpsearch.ntnu.no ) installed.
I just uploaded pcre-3.1-47 to RedHat, it should be available on every
http://contrib.redhat.com mirror in the next few days.

If you include this feature, any Perl-ish regex can be used on
"--accept" and "--reject" options, the old aren't the supported any
more.
Examples:
 -A 'HTML,HTM,JPG,JPEG' is now: -A 'HTM.?$|JP.?G$'
Or when mirroring Apache servers:
 -R '^\?.=[AD]'

This seems to be quite complicated for those who did never use Perl, but
it is powerful as hell.
BTW: Matching is done case-independently.

Attached is a patch against my last "gold" version.

The most recent package for now is
  http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz
Or try
  http://www.simprovement.com/linux/frech.cgi

My "gold" version includes some patches from this list (e.g. preliminary
HTTPS support) and my own extensions.

Please send me any patches/bugfixes you make!

Bye,
-- 
     Soenke Jan  Peters
 |_  _|   Rostock, Germany   PGP on request,
 ._|| [EMAIL PROTECTED]Subject: getpgpkey
 http://www.simprovement.comNo solicitations!

diff -urN --exclude=configure wget-1.5.3gold/ChangeLog wget-1.5.3gold2/ChangeLog
--- wget-1.5.3gold/ChangeLogWed May 24 22:29:18 2000
+++ wget-1.5.3gold2/ChangeLog   Fri Jun 30 00:39:45 2000
@@ -1,3 +1,15 @@
+2000-06-29  Soenke J. Peters [EMAIL PROTECTED]
+   *  PCRE library option at compile time
+   for --accept and --reject options
+
+   * HTTPS support via OpenSSL
+
+   * Alternative print_percentage() function
+
+   * --filter-script option
+
+   * Referrer faking via --referer=fake
+
 2000-05-24  Dan Harkless  [EMAIL PROTECTED]
 
* TODO: Timestamps sometimes not copied over on files retrieved by FTP.
diff -urN --exclude=configure wget-1.5.3gold/Makefile.in wget-1.5.3gold2/Makefile.in
--- wget-1.5.3gold/Makefile.in  Mon Jun 26 07:10:44 2000
+++ wget-1.5.3gold2/Makefile.in Fri Jun 30 00:23:19 2000
@@ -47,7 +47,8 @@
 CPPFLAGS = @CPPFLAGS@ -I$(SSL_INCDIR)
 DEFS = @DEFS@ -DSYSTEM_WGETRC=\"$(sysconfdir)/wgetrc\" -DLOCALEDIR=\"$(localedir)\"
 SSL_LIBS = @SSL_LIBS@
-LIBS = @LIBS@ $(SSL_LIBS)
+PCRE_LIBS = @PCRE_LIBS@
+LIBS = @LIBS@ $(SSL_LIBS) $(PCRE_LIBS)
 SSL_LIBDIR = @SSL_LIBDIR@
 LDFLAGS = @LDFLAGS@
 
diff -urN --exclude=configure wget-1.5.3gold/NEWS wget-1.5.3gold2/NEWS
--- wget-1.5.3gold/NEWS Mon Jun 26 10:59:15 2000
+++ wget-1.5.3gold2/NEWSFri Jun 30 00:38:00 2000
@@ -5,6 +5,11 @@
 
 Please send GNU Wget bug reports to [EMAIL PROTECTED].
 
+* Changes in Wget 1.5.3gold2
+
+** PCRE library option at compile time for --accept and --reject options
+
+
 * Changes in Wget 1.5.3gold
 
 ** HTTPS support via OpenSSL
diff -urN --exclude=configure wget-1.5.3gold/configure.in wget-1.5.3gold2/configure.in
--- wget-1.5.3gold/configure.in Mon Jun 26 05:28:27 2000
+++ wget-1.5.3gold2/configure.inFri Jun 30 00:54:03 2000
@@ -61,6 +61,20 @@
 [  --with-ssl-libdir=DIR   where to find SSLeay library files (optional)],
 ssl_libdir="$withval", ssl_libdir="/usr/lib")
 
+AC_ARG_WITH(pcre,
+[  --with-pcre compile in PCRE support],
+[if test "$withval" != "no" ; then
+  

Re: how to filter only certain URL's?

2001-02-19 Thread Gary Funck

On Feb 19,  1:51pm, Dan Harkless wrote:
 
 Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold,
 which has the Perl regular expression library integrated for use with
 --accept and --reject (see attached mail).
 
 At this time, no opinion has been expressed as to whether the mainline Wget
 should include this change.

Dan, thanks.  Looks like that would be just the ticket, however,
at the present time, this URL (and the top-level web site) come
up empty:
   http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz

 Date: Fri, 30 Jun 2000 01:43:15 +0200
 From: "Soenke J. Peters" [EMAIL PROTECTED]
[...]
 
 If you include this feature, any Perl-ish regex can be used on
 "--accept" and "--reject" options, the old aren't the supported any
 more.
 Examples:
  -A 'HTML,HTM,JPG,JPEG' is now: -A 'HTM.?$|JP.?G$'
 Or when mirroring Apache servers:
  -R '^\?.=[AD]'
 
 This seems to be quite complicated for those who did never use Perl, but
 it is powerful as hell.
 BTW: Matching is done case-independently.
 
 Attached is a patch against my last "gold" version.
 
 The most recent package for now is
   http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz
 Or try
   http://www.simprovement.com/linux/frech.cgi
 
 My "gold" version includes some patches from this list (e.g. preliminary
 HTTPS support) and my own extensions.

Would you happen to have another URL/source for this version of wget
(I checked a few of the wget sites, but didn't find it).  Thanks.



Re: how to filter only certain URL's?

2001-02-19 Thread [EMAIL PROTECTED]

On Mon, 19 Feb 2001, Dan Harkless wrote:

 Dunno about other programs, but you could try Soenke Peter's wget-1.5.3gold,
 which has the Perl regular expression library integrated for use with
 --accept and --reject (see attached mail).

 At this time, no opinion has been expressed as to whether the mainline Wget
 should include this change.

Being a [useful-]feature junkie when it comes to wget I feel this could be
included without making the program bloated in any way. Regular
expressions are a good thing, almost always, and it would serve wget's
purpose just great if it works flawless as described (haven't had time to
patch and test myself) since perl is the grand master when it comes to
regular expressions. Maybe with a brief introduction to regular
expressions in the docs/info I think more people would get less files they
don't want when they mirror a page or an ftp (making less traffic on the
network = good), and at the same time you could have _very_ complex
mirrors.

Although replacing the current -A and -R options might not be such a good
idea, since older scripts depending on wget with -A,-R and people who use
them now regulary would be confused if it just stopped working.. backward
compability is always good, but then again it's a pain when it restrain
progress - and I personally have no clue how this could be implemented as
it is right now without letting go of the old -A,-R style, if you don't
create brand new options of course. Then again, having options for _both_
perl regular expressions _and_ the old style sounds like an even worse
idea.

(I'm so happy I'm just a user and not a developer! :-) )


Best regards
 Henrik van Ginhoven, Sweden
 9799-5
 Everyone a story but they end the same




Re: how to filter only certain URL's?

2001-02-19 Thread Gary Funck

On Feb 19,  3:09pm, Gary Funck wrote:
 
 Dan, thanks.  Looks like that would be just the ticket, however,
 at the present time, this URL (and the top-level web site) come
 up empty:
http://www.simprovement.com/linux/files/wget-1.5.3gold2.tar.gz
 
 Would you happen to have another URL/source for this version of wget
 (I checked a few of the wget sites, but didn't find it).  Thanks.

Using google, I found this source RPM.  A little out of the way for
US users, but here it is nevertheless:
ftp://ftp.lip6.fr/pub/linux/distributions/redhat-contrib/libc6/SRPMS/wget-1.5.3gold-1.src.rpm





Re: how to filter only certain URL's?

2001-02-19 Thread Dan Harkless


"[EMAIL PROTECTED]" [EMAIL PROTECTED] writes:
 Being a [useful-]feature junkie when it comes to wget I feel this could be
 included without making the program bloated in any way. Regular
 expressions are a good thing, almost always, and it would serve wget's
 purpose just great if it works flawless as described (haven't had time to
 patch and test myself) since perl is the grand master when it comes to
 regular expressions. Maybe with a brief introduction to regular
 expressions in the docs/info I think more people would get less files they
 don't want when they mirror a page or an ftp (making less traffic on the
 network = good), and at the same time you could have _very_ complex
 mirrors.
 
 Although replacing the current -A and -R options might not be such a good
 idea, since older scripts depending on wget with -A,-R and people who use
 them now regulary would be confused if it just stopped working.. backward
 compability is always good, but then again it's a pain when it restrain
 progress - and I personally have no clue how this could be implemented as
 it is right now without letting go of the old -A,-R style, if you don't
 create brand new options of course. Then again, having options for _both_
 perl regular expressions _and_ the old style sounds like an even worse
 idea.

Yes, the fact that you would now have to understand regexps to use -A and -R
is one reason that patch didn't get immediately thrown into the main Wget.
One way to go might be to have a --regexp option which turns on regexp
support in all options that have it (probably starting with just -A and -R).
Those that wanted to could turn the option on all the time in their .wgetrc.

Also we'd have to make sure there are no problems bundling PCRE (or another
regexp library, if necessary) with Wget.

---
Dan Harkless| To help prevent SPAM contamination,
GNU Wget co-maintainer  | please do not mention this email
http://sunsite.dk/wget/ | address in Usenet posts -- thank you.



Re: how to filter only certain URL's?

2001-02-19 Thread Dan Harkless


[EMAIL PROTECTED] (Gary Funck) writes:
 Dan, sorry to trouble you

Please post to [EMAIL PROTECTED] rather that putting the onus on one person to
answer you (and depriving everyone else of the information).

 - but that RPM URL that I mentioned,
 appeared to be version that has pattern matching in it, but now it
 appears that this version has some sort of shell-like globbing,
 but doesn't have the regex stuff.  I actually would prefer the
 regex version for whatI'm trying to do, and there's no docs.
 on how the globbing works, how much of a pathname I can use it
 on, etc.  (ie, does it only match the part of the URL after the
 rightmost slash?)
 
 If you can point me at a copy of the regex version, that'd be
 great.  thanks, - Gary

The sum of my knowledge on 1.5.3gold is what I read in the author's post.
You could try emailing him.  Also, perhaps the patch in the email I
forwarded successfully applies to the RPM'd version and converts it to
regexp.

---
Dan Harkless| To help prevent SPAM contamination,
GNU Wget co-maintainer  | please do not mention this email
http://sunsite.dk/wget/ | address in Usenet posts -- thank you.



Re: how to filter only certain URL's?

2001-02-19 Thread Gary Funck

On Feb 19,  5:17pm, Dan Harkless wrote:
 Subject: Re: how to filter only certain URL's?
 
 [EMAIL PROTECTED] (Gary Funck) writes:
  Dan, sorry to trouble you
 
 Please post to [EMAIL PROTECTED] rather that putting the onus on one person to
 answer you (and depriving everyone else of the information).

Okay. no prob.

 
  - but that RPM URL that I mentioned,
  appeared to be version that has pattern matching in it, but now it
  appears that this version has some sort of shell-like globbing,
  but doesn't have the regex stuff.  I actually would prefer the
  regex version for what  I'm trying to do, and there's no docs.
  on how the globbing works, how much of a pathname I can use it
  on, etc.  (ie, does it only match the part of the URL after the
  rightmost slash?)

I experimented with the above mentioned "globbing", but couldn't
figure out how it works (though admittedly I didn't try firing up
the debugger to see what's going on).

One thing that the matching did appear to doing however, is
first *downloading the entire page* before making the decision
as to whether to keep the page or not.  This is decidely not the
preferred implementationn -- it wastes bandwidth.

  
  If you can point me at a copy of the regex version, that'd be
  great.  thanks, - Gary
 
 The sum of my knowledge on 1.5.3gold is what I read in the author's post.
 You could try emailing him.  Also, perhaps the patch in the email I
 forwarded successfully applies to the RPM'd version and converts it to
 regexp.

Thanks - I didn't notice the patchset attachment when I first read
your e-mail.  I'll give it a shot.



Re: how to filter only certain URL's?

2001-02-19 Thread Gary Funck

On Feb 19,  5:55pm, Gary Funck wrote:
 
 One thing that the matching did appear to doing however, is
 first *downloading the entire page* before making the decision
 as to whether to keep the page or not.  This is decidely not the
 preferred implementationn -- it wastes bandwidth.
 

Well, wget does have to look into each page in order to
perform its recursive traversal, so it downloads the page first.
However, there are cases when the -l (max. number of levels) option
is asserted where wget would know by definition that the URL it
is looking at is a leaf node (no further traversals would be allowd
by the -l option), and in that case downloading the page is wasted
effort, if its name doesn't match the -A requirement.