Hi,

On Mon, Sep 19, 2011 at 10:56 PM, Markus Jelsma
<[email protected]> wrote:
> There are now several cases known to us where we would like to control URL
> resolving. All cases share one similarity, URL's being relative in the
> original source. How could we instruct the parser or modify the code to do so?

I guess we could make the URL resolution mechanism pluggable.

But I still don't see how else you'd resolve relative URLs than what's
now being done in Tika's HtmlHandler.resolve() method.

Generally speaking avoiding problems like the recursive URL you
mentioned should be done above the level of URL resolution. For
example, your crawler would face the exact same problem when
encountering say a dynamic calendar web site with links to the next or
previous day. Such an infinite URL space is perfectly valid, so no
resolution mechanism could prevent the crawler from entering such an
trap. Instead the crawler should employ heuristics like maximum
recursion depth, etc. to avoid such problems.

BR,

Jukka Zitting

Reply via email to