Re: Resolving of relative URL's

Jukka Zitting Mon, 12 Sep 2011 09:09:55 -0700

Hi,

On Mon, Sep 12, 2011 at 6:00 PM, Markus Jelsma
<[email protected]> wrote:
> Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding
> to self-referring relative URL's. Consider http://example.org/content/ with a
> list of relative links (menu on each page) of which one or more is actually
> incorrect:
>
> ../more-content/
> ../other-content/
> wrong-link/
> ../even-more/content/
>
> For pages without base href the wrong-link/ is resolved to
> http://example.org/content/wrong-link/. The new page also contains the same
> url list as above so the next wrong link is resolved as
> http://example.org/content/wrong-link/wrong-link/......
>
> An endless nightmare for a crawler :)


How would not resolving the links in Tika help in this case? To crawl
the site, the crawler would in any case have to resolve the links, and
come up with the exact same resolved URLs.

BR,

Jukka Zitting

Re: Resolving of relative URL's

Reply via email to