Hi, On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma <[email protected]> wrote: > Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding > to self-referring relative URL's. Consider http://example.org/content/ with a > list of relative links (menu on each page) of which one or more is actually > incorrect: > > ../more-content/ > ../other-content/ > wrong-link/ > ../even-more/content/ > > For pages without base href the wrong-link/ is resolved to > http://example.org/content/wrong-link/. The new page also contains the same > url list as above so the next wrong link is resolved as > http://example.org/content/wrong-link/wrong-link/...... > > An endless nightmare for a crawler :)
How would not resolving the links in Tika help in this case? To crawl the site, the crawler would in any case have to resolve the links, and come up with the exact same resolved URLs. BR, Jukka Zitting
