> One could also choose to consider it as a bug of plucker, since even
> when following zero depth links (i.e. none), it should convert the
> "embedded" images which is kind of on level 0.

        I had this discussion long ago when Holger was maintaining the
Python spider for Plucker. The industry manner I've seen in dozens of
papers on spidering says that a 'maxdepth' value doesn't count the zero
depth. If you want a page, and all the images on it, that is a maxdepth of
zero (maxdepth=0). If you want to grab all links from that page, you need
a maxdepth of 1 (maxdepth=1).

        Simply put:

        "How many links from this page do you want to gather?"

        "Don't grab any links from the page." (no links == 0)

        "Grab only the first level links from this page." (depth == 1)

        "Grab two levels of links from this page." (depth == 2)

        ..and so on.

        -- main page (level 0)
          |
          |---- first level links (level 1)
               |
               |---- second level links (level 2)

        It's important to remember that there is no notion of 'doing down
4 links' on the web. Everything on the web is at exactly a depth of 1 from
everything else. Everything is one click from the next lateral thing. The
web is horizontal, not vertical. There is no 'down' and 'up' with regard
to links. The webpage can be considered a tree of links, and that tree
starts at node zero (0). Going "down 4 links" is actually away from the
trunk of the tree (total of 5 nodes, 0..4), towards the end of a branch,
and similarly, going "back up" is the reverse action.

        Remember also, both projects are only doing a depth-first
spidering value. We haven't yet considered the 'breadth-first' value (I
started playing with this in the perl version of the Plucker spider with
the LWP::Parallel, but it wasn't robust enough at the time to do what I
needed).



/d



_______________________________________________
Sitescooper-talk mailing list
[EMAIL PROTECTED]
http://lists.sourceforge.net/lists/listinfo/sitescooper-talk

Reply via email to