Jukka and others, There are now several cases known to us where we would like to control URL resolving. All cases share one similarity, URL's being relative in the original source. How could we instruct the parser or modify the code to do so?
Right now we need to come up with regular expressions to detect commonalities in URI segments and throw them away. Thanks > Hi, > > On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma > > <[email protected]> wrote: > > Yes! Nutch extracts all outlinks but there is a tedious crawler trap > > regarding to self-referring relative URL's. Consider > > http://example.org/content/ with a list of relative links (menu on each > > page) of which one or more is actually incorrect: > > > > ../more-content/ > > ../other-content/ > > wrong-link/ > > ../even-more/content/ > > > > For pages without base href the wrong-link/ is resolved to > > http://example.org/content/wrong-link/. The new page also contains the > > same url list as above so the next wrong link is resolved as > > http://example.org/content/wrong-link/wrong-link/...... > > > > An endless nightmare for a crawler :) > > How would not resolving the links in Tika help in this case? To crawl > the site, the crawler would in any case have to resolve the links, and > come up with the exact same resolved URLs. > > BR, > > Jukka Zitting
