Hi, On Monday 12 September 2011 17:35:49 Jukka Zitting wrote: > Hi, > > On Mon, Sep 12, 2011 at 4:58 PM, Markus Jelsma > > <[email protected]> wrote: > > Since TIKA-287 all relative URL's are resolved to absolutes regardless of > > the presence of the base element. This is not always desired behaviour. > > Can you describe a use case where that's not the desired behaviour? I > would assume that a resolved URL is always preferred to an unresolved > one.
Yes! Nutch extracts all outlinks but there is a tedious crawler trap regarding to self-referring relative URL's. Consider http://example.org/content/ with a list of relative links (menu on each page) of which one or more is actually incorrect: ../more-content/ ../other-content/ wrong-link/ ../even-more/content/ For pages without base href the wrong-link/ is resolved to http://example.org/content/wrong-link/. The new page also contains the same url list as above so the next wrong link is resolved as http://example.org/content/wrong-link/wrong-link/...... An endless nightmare for a crawler :) > > > Would it be possible to use some setting to instruct the parser not to > > resolve URL's if the base element doesn't exist or does not have an href > > attribute with a valid absolute URL? > > Currently Tika looks at the CONTENT_LOCATION and RESOURCE_NAME_KEY > metadata keys for the default base URL. If neither is present and > there is no <base href=".."> element, then URLs in the document will > not be resolved. Hm, testing with Nutch i see that URL's are always extracted. Seems at least one meta data key is present although i'm not too sure. In the Nutch code an empty org.apache.tika.metadata.Metadata object is passed to the parse() method. > > BR, > > Jukka Zitting -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350
