[ https://issues.apache.org/jira/browse/TIKA-287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12765806#action_12765806 ]
Ken Krugler commented on TIKA-287: ---------------------------------- [hmm, where did my comment go? Retyping] Wish I had time to submit a patch. But the code I used is: 1. Use incoming CONTENT_LOCATION in metadata to set up base URL. 2. Watch for <base> element in head, update base with the cleaned up href. 3. When you get an <a> element, use the cleaned up href in a call to a URL relative resolver. 4. Always trim the href you get, and strip out any CR/LF chars. 5. Attached is an example of the URL resolver code w/tests. Not formatted properly, and should use a pattern with lower-case insensitive matching if you want to pass unnormalized URLs to the routine. Hope this helps...Ken > HtmlParser should resolve relative paths in <a href="xxx"> elements > ------------------------------------------------------------------- > > Key: TIKA-287 > URL: https://issues.apache.org/jira/browse/TIKA-287 > Project: Tika > Issue Type: Improvement > Components: parser > Affects Versions: 0.4 > Reporter: Ken Krugler > Assignee: Jukka Zitting > Attachments: UrlUtils.java, UrlUtilsTest.java > > > Currently clients of the HtmlParser need to manually keep track of the > appropriate base URL to use when resolving relative URLs in href="xxx" > attributes. > The parser should use the metadata RESOURCE_NAME_KEY value as the base. > The parser should also watch for a <base> element in the <head> section, and > use that to update the base URL. > Note that special care must be taken to work around a known bug in the Java > URL() class, when the relative URL is a query string and the base URL doesn't > end with a '/'. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.