[ 
https://issues.apache.org/jira/browse/TIKA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12760027#action_12760027
 ] 

Uwe Schindler commented on TIKA-287:
------------------------------------

This is a nice addition. Parsing the base tag is not complicated, just add a 
check inside the startElement() handler of the HTML parser. I have this in my 
own (non TIKA-related) nekohtml content handlers, too (see analyzeHTML private 
method in 
[http://panfmp.svn.sourceforge.net/viewvc/panfmp/main/trunk/src/de/pangaea/metadataportal/harvester/WebCrawlingHarvester.java?view=markup])

> HtmlParser should resolve relative paths in <a href="xxx"> elements
> -------------------------------------------------------------------
>
>                 Key: TIKA-287
>                 URL: https://issues.apache.org/jira/browse/TIKA-287
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 0.4
>            Reporter: Ken Krugler
>
> Currently clients of the HtmlParser need to manually keep track of the 
> appropriate base URL to use when resolving relative URLs in href="xxx" 
> attributes.
> The parser should use the metadata RESOURCE_NAME_KEY value as the base.
> The parser should also watch for a <base> element in the <head> section, and 
> use that to update the base URL.
> Note that special care must be taken to work around a known bug in the Java 
> URL() class, when the relative URL is a query string and the base URL doesn't 
> end with a '/'.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to