Re: Accept and Reject - particularly for PHP and CGI sites

Micah Cowan Wed, 19 Mar 2008 13:06:59 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Pattist wrote:
> Micah Cowan wrote:
>> Well, -E is special, true. But in general the second quote is (by
>> definition) correct.
>>
>> - -E, obviously, _shouldn't_ be special...
> 
> I hope it's clear I'm not complaining.


I didn't take it as complaining.

>>> I haven't yet quite figured out file extension matching versus string
>>> matching in filenames, but extensions seem to match regardless of
>>> leading characters or following ?id=1 parameters.
>>
>> That's right; the "query" portion of the URL is not used to determine
>> matching. There are, of course, times when you specifically wish to tell
>> wget not to follow certain specific query strings (such as edit or print
>> or... in wikis); wget doesn't currently support this (I plan to fix
>> this).
> 
> Now I'm confused again.  I suppose I can go through more trial and error
>  or dig through the source to figure out what it's really doing, but in
> hopes you can throw more light on this, I'll explicate what is confusing
> me. (comments relate to wget 1.11 running on Windows XP)
> 
> Confusion 1:  Right now, I'm only using file extensions in the accept=
> parameters, such as  accept=zip,jpg,gif,php  etc.  Even if the query
> portion (the "?id=1" part of site.com/index.php?id=1) is not considered
> during matching, it's not clear to me why accept=php matches
> "site.com/index.php".  Why don't I need *.php (Windows) or *php
> (assuming the *glob matches the period).  Would "accept=index" match
> "index.php?id=1"? How about "accept=*index*"

(This is in the documentation; at least the full documentation. See the
manual on the website; I think the Windows Help files that ship with
Wget are based on a "short" version of the manual).

The way the matching works is that, if there are any wildcard characters
(any of '*', '?', '[' or ']'), then it is a wildcard pattern; otherwise,
it's matched exactly against the filename suffix (not necessarily
extension). "php" will match index.php, or even "shophp", but not
"index.php.foo". "*.php" wouldn't match "shophp", since the period is
right there.

This is only ever matched against the filename, and never the domain,
directory, or query string (actually, as you've discovered, it's matched
against the _local_ filename for some cases, which needs to be fixed).

As I currently understand it from the code, at least for Wget 1.11,
matching is against the _URL_'s filename portion (and only that portion:
no query strings, no directories) when deciding whether it should
download something through a recursive descent (the relevant spot in the
code is in recur.c, marked by a comment starting "6. Check for
acceptance/rejection rules.").

When deciding whether it should delete a file afterwards, however, it
uses the _local_ filename (relevant code also in recur.c, near "Either
- --delete-after was specified,"). I'm not positive, but this probably
means query strings _do_ matter in that case. :p

Confused? Coz I sure am!

> I assumed I could do an
> accept match on the query portion, the filename portion, or even the
> domain, but I suspect now that's wrong.  The domain gets stripped off
> when the local name is constructed, so I realize now I can't match on
> that (local filename used for matching), but the query portion is
> usually left as part of the filename, with an atsign replacing the
> question mark.  Is filename matching allowed or only extension matching?

Well.... there's a _separate_ option for matching/rejecting domain names
(which requires -H to be meaningful, since by default Wget only allows
hosts you've explicitly requested, plus any that result from redirections).

> Confusion 2: I'm rejecting based on the query string, usually after an
> accept string allowing defined extensions.  I think I understand this,
> and I think it's working fine.  I'm usually doing something like
> reject=*logout*,*subscribe=*,*watch=* to prevent traversal of logout
> links or thread subscription links in a phpbb setting.  This works.  I
> think it's doing exactly what you say it's not yet capable of doing, but
> maybe I'm missing something.  Does the accept matching work differently
> from the reject matching?

They use _exactly_ the same code.

> Does reject work on the URL before retrieval,
> but accept work on the local filename after retrieval?  If the
> site.com/index.php?mode=logout link was being traversed with
> accept=php and reject=*logout*, I would be getting logged out, but I'm not.

What site is it? You might run wget with --debug to find out _exactly_
why it doesn't traverse these (see
http://wget.addictivecode.org/FrequentlyAskedQuestions#not-downloading
for an enumeration of various messages Wget uses to say why something
isn't downloaded). Some sites are intelligent enough to include a
"rel=nofollow" or "nofollow" attribute in their anchor tags, which Wget
will obey unless -e robots=off was specified. The MoinMoin wiki
software, for instance, will do this (which is what the Wget Wgiki runs on).

> Hmmmmm..... light perhaps begins to dawn.  It looks like both accept and
> reject are applied twice - once before retrieval and once after.

Yup!

> To be retrieved/traversed it has to pass both filters and then after local
> renaming, it has to pass both again.  That would fit what I'm seeing. My
> reject filter prevents traversing logout links during the first pass and
> my accept filter deletes php files during the second check after html
> renaming.

I think it's probably not preventing the traversal, but that traversal
is being prevented by other means.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH4XJF7M8hyUobTrERAk3rAJ4jkKYAE7k6E3oXOxg6rzQ6UPlYFACghwxM
Dz7TBOfdSxjh8LsbVCl0sWU=
=lc4f
-----END PGP SIGNATURE-----

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to