Hi,

On Monday 12 September 2011 17:35:49 Jukka Zitting wrote:
> Hi,
> 
> On Mon, Sep 12, 2011 at 4:58 PM, Markus Jelsma
> 
> <[email protected]> wrote:
> > Since TIKA-287 all relative URL's are resolved to absolutes regardless of
> > the presence of the base element. This is not always desired behaviour.
> 
> Can you describe a use case where that's not the desired behaviour? I
> would assume that a resolved URL is always preferred to an unresolved
> one.

Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding 
to self-referring relative URL's. Consider http://example.org/content/ with a 
list of relative links (menu on each page) of which one or more is actually 
incorrect:

../more-content/
../other-content/
wrong-link/
../even-more/content/

For pages without base href the wrong-link/ is resolved to 
http://example.org/content/wrong-link/. The new page also contains the same 
url list as above so the next wrong link is resolved as 
http://example.org/content/wrong-link/wrong-link/......

An endless nightmare for a crawler :)

> 
> > Would it be possible to use some setting to instruct the parser not to
> > resolve URL's if the base element doesn't exist or does not have an href
> > attribute with a valid absolute URL?
> 
> Currently Tika looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY
> metadata keys for the default base URL. If neither is present and
> there is no <base href=".."> element, then URLs in the document will
> not be resolved.

Hm, testing with Nutch i see that URL's are always extracted. Seems at least 
one meta data key is present although i'm not too sure. In the Nutch code an 
empty org.apache.tika.metadata.Metadata object is passed to the parse() 
method.

> 
> BR,
> 
> Jukka Zitting

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Reply via email to