On Monday 12 September 2011 18:08:50 Jukka Zitting wrote: > > For pages without base href the wrong-link/ is resolved to > > http://example.org/content/wrong-link/. The new page also contains the > > same url list as above so the next wrong link is resolved as > > http://example.org/content/wrong-link/wrong-link/...... > > > > An endless nightmare for a crawler :) > > How would not resolving the links in Tika help in this case? To crawl > the site, the crawler would in any case have to resolve the links, and > come up with the exact same resolved URLs. >
I could choose not to collect those relative URL's as outlink. Right now i cannot determine whether a URL was originally a relative URL. > BR, > > Jukka Zitting -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
