A path-ascending crawler is one that, when given the URL http://foo.org/a/b/page.html, will attempt to crawl

http://foo.org/a/b/page.html
http://foo.org/a/b/
http://foo.org/a/
http://foo.org/

This will increase the ability of the crawler to find resources that are not linked to by other resources, giving a more complete picture of the actual contents of a web server. See "Web-Crawling Reliability" by Viv Cothey (2004) for more info.

It would be nice to have this functionality in wget.  Something like:

wget -r -path-ascend http://foo.org/

What do you guys think?

Frank

Reply via email to