Re: Accept and Reject - particularly for PHP and CGI sites

Micah Cowan Mon, 10 Mar 2008 14:10:00 -0700

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Pattist wrote:
> Thank you for the quick response.  Background is I'm on Windows XP, Gnu
> wget 1.11
>> This "doesn't affect traversal of HTML files" functionality is currently
>> implemented via a heuristic based on the filename extension. That is, if
>> it ends in ".htm" or ".html", I believe, then it will be traversed
>> regardless of -A or -R settings, whereas .cgi or .php will not affect
>> traversal.
>>   
> I'm not sure I understand the "cgi or .php will not affect traversal."


I mean, it will not detect these as HTML files, so the accept/reject
rules will be applied to them without exception.

> If I use wget to start at http://site.com/view.php?f=16 and recursively
> mirror without -A or -R, it looks like it  traverses deeper as though
> that page and other .php links are html files. This makes sense. (I say
> looks like, because it takes a long time and produces lots of files). 
> If I select the same page and add  accept=site.com/view.php?id=16 to
> wgetrc, no pages are saved and it does not traverse any deeper and it
> takes only a second or two.  I see this in the log:
> 
> Saving to: `site.com/[EMAIL PROTECTED]'
> Removing site.com/[EMAIL PROTECTED] since it should be rejected.
> 
> I recognize that the question mark was substituted for my OS, but that
> does not matter on the accept filter.  What does matter is whether I
> have the .html or not in the accept filter.  That surprises me.  Both
> accept=site.com/view.php?id=16.html and accept=site.com/view.php?id=16*
> will match and keep the
> site.com/[EMAIL PROTECTED] file, while both
> accept=site.com/view.php?id=16 and accept=site.com/[EMAIL PROTECTED] cause
> it not to match and generate the "Removing ... since it should be
> rejected" line.  Regardless of the matching/saving this seems to control
> traversal, as I get far deeper traversal with no accept= at all.

After another look at the relevant portions of the source code, it looks
like accept/reject rules are _always_ applied against the local
filename, contrary to what I'd been thinking. This needs to be changed.
(But it probably won't be, any time soon.)

Note that the view.php?id=16 doesn't mean what you may perhaps think it
does: Wget detects the "?" as a wildcard, and allows it to match any
character (including "@"). If you supplied "\?" instead (which matches a
literal question mark), I'm guessing it'd actually fail to match,
because it's checking against "@".

My understanding is that, when you specify a URL directly at the
command-line, it will be downloaded and traversed (if it turns out to be
HTML), no matter what the accept/reject rules are (which can still cause
it to be removed afterwards). Therefore, I suspect that what Wget does
with your URL when it isn't matching the accept rules is:

  1. Downloads the named file
  2. Discovers that, regardless of the filename, it is indeed an HTML
file, so scans it for all links to be downloaded.
  3. After scanning for all the links, it doesn't find any that end in
".html", nor any that match the accept rules, so it doesn't do anything
else.

- --debug will definitely tell you whether it's bothering to scan that
first file or not, and what it decides to do with the links it finds.

> I'm pretty sure  I can control traversal of php links with accept and
> reject, but I often want to traverse looking for certain file types, but
> don't want to save all the php files traversed.

We're looking for more fine-grained controls to allow this sort of
thing, but at the moment, my understanding is that there is no control
over whether Wget traverses-and-then-deletes a given file: it will
_always_ do that for files it knows or suspects are HTML (based on .htm,
.html suffixes, or if, like the above example, it will download the
filename first anyway because it's an explicit command-line argument);
it will _never_ download/traverse any other sorts of links that do not
match the accept rules.

If something _does_ match the accept rules, and turns out after download
to be an HTML file (determined by the server's headers), it will
traverse it further; but of course it won't delete them afterward
because they matched the accept list.

>> I'd have to look at the relevant code, but it's possible that
>> "directory"-looking names may also be automatically traversed in that way.
>>   
> I don't want you to do work I can do myself.  I was just hoping for a
> link or some pointers that might help.

It looks like this idea was incorrect anyway; it's only based on the suffix.

>>> Does
>>> html_extension=on affect link traversal? 
>>>     
>>
>> No; this only affects whether filenames are changed upon download to
>> explicitly include an ".html" extension (useful for local browsing).
>>   
> 
> It seems that the html extension is used in the filter matching of
> accept/reject, and that seems to affect traversal as described above
> unless I'm missing something (which is entirely possible).

Yes, it does; my bad.

>>> I'd like to be able to
>>> independently control link traversal vs. file retrieval with local file
>>> storage.  Do the directory include/exclude commands allow this - do they
>>> work differently from -A -R?
>>>     
>>
>> I'm afraid I'm unsure what you are asking here.
>>   
> Is my question clearer from the above?  I'm seeing very quick exits
> (seconds) when the accept filter does not match the start page.  To get
> deeper traversing, I have to match, but then it saves the matched files
> and the traverse takes hours, with perhaps thousands of html files
> (converted from .php files), none of which  I need. 

Yes, the question is clearer, and unfortunately the answer is "not
currently". :\


>>> 3) Which has priority if both reject and accept filters match?
>>>     
>> Not sure; it's easy enough to test this yourself, though.
>>   
> I have done lots of testing, so you'd think this simple one would be
> obvious.  The answer seems to be that reject is higher priority, since
> identical accept= and reject= seem to produce no output.  This matches
> what the manual says.

- --debug is your friend. It will tell you explicitly what it thinks about
the links it finds.

> It might help to add to the manual that adding an
> accept= filter causes a rejection of everything that does not match the
> accept filter, even if there is no reject filter specified.  The fact
> that specifically accepting some files turns on a default rejection of
> everything else surprised me, since the normal default is to accept
> everything.

It actually is in the manual, but probably not the Windows Help
documentation that you've got. My recollection is that the latter is
generated from the abbreviated reference which, on Unix, becomes the
"manpage".

The full manual is available, in various formats, at
http://www.gnu.org/software/wget/manual/.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1aOL7M8hyUobTrERAiAnAJ4uk6R1jHaFEYkwScu9RKe6acGVQQCcC6lz
twxxUd2OzSjHEeSWZ/MVKOA=
=CHwi
-----END PGP SIGNATURE-----

Re: Accept and Reject - particularly for PHP and CGI sites

Reply via email to