-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Todd Pattist wrote:
> I'm having trouble understanding how accept and reject work,
> particularly in the context of sites that rely on CGI and PHP to
> dynamically generate html pages.  My questions relate to the following:
> 
> 1) I don't fully understand the -A and -R effects and the difference, if
> any, between what links are traversed and parsed for deeper links,
> versus what files are kept and stored locally.  The docs seem to say
> that -A and -R have no effect on the link traverse for html files, but
> this doesn't seem true for dynamically generated CGI, PHP files.

This "doesn't affect traversal of HTML files" functionality is currently
implemented via a heuristic based on the filename extension. That is, if
it ends in ".htm" or ".html", I believe, then it will be traversed
regardless of -A or -R settings, whereas .cgi or .php will not affect
traversal.

I'd have to look at the relevant code, but it's possible that
"directory"-looking names may also be automatically traversed in that way.

> Does
> html_extension=on affect link traversal? 

No; this only affects whether filenames are changed upon download to
explicitly include an ".html" extension (useful for local browsing).

> I'd like to be able to
> independently control link traversal vs. file retrieval with local file
> storage.  Do the directory include/exclude commands allow this - do they
> work differently from -A -R?

I'm afraid I'm unsure what you are asking here.

> 2) The logs seem to show PHP files being retrieved and then not saved.
> When mirroring a forum, you often want to exclude links that do a
> logout, or subscribe you to a topic.  Does -R prevent a dynamically
> generated html page from a PHP link from being traversed?

I think I'd need to see an example log of files "being retrieved and
then not saved", to understand what you mean.

> 3) Which has priority if both reject and accept filters match?

Not sure; it's easy enough to test this yourself, though.

> 4) Sometimes the OS restricts filename characters.  Do the -A and -R
> filters match on the final name used to store the file, or on the name
> at the server?

They should match the server's name (which includes the
Content-Disposition name, if that's being used); however, there were at
least some situations where the local name was being matched (there was
the case when -nd was being used, at least); I can't recall whether that
was resolved yet, I'm guessing not.

Please feel free to report any other cases you encounter, where local
transformations result in erroneous matches from -A/-R.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer...
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH1Xop7M8hyUobTrERAjRSAJ4o5RsliyGZ52mRTeuS75e8oR/lYACgg0DU
KFDXK8QMOJI2NLJqAK+HDP0=
=uP/C
-----END PGP SIGNATURE-----

Reply via email to