[ 
https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856839#action_12856839
 ] 

Jukka Zitting commented on TIKA-379:
------------------------------------

The reason for the default HTML mapping rules in Tika are to simplify and 
normalize the input documents so that client applications could easily process 
all sorts of input (HTML or not) without needing type- or source-specific 
heuristics. The basic idea has been that clients should directly use the 
underlying parser libraries when it needs custom processing of specific content 
types.

That said, I see the value of being able to process even complex HTML input 
through the Tika API, and perhaps the above original intent is too strict for 
many use cases. The HtmlMapper interface we added for TIKA-347 should make it 
possible to relax the mapping rules, and in revision 933909 I added a 
IdentityHtmlMapper implementation of this interface to make it even easier to 
use:

    ParseContext context = new ParseContext();
    context.set(HtmlMapper.class, IdentityHtmlMapper.INSTANCE);

Note that IdentityHtmlMapper breaks the guarantee that the Tika output is valid 
XHTML. Also, currently the HtmlMapper interface only covers elements, so all 
attributes are still lost and IdentityHtmlMapper overrides the custom <a/> tag 
handling in HtmlHandler so even the href attributes are gone. It would be good 
if we could extend the HtmlMapper mechanism to avoid these problems.

> Html elements and attributes not available in XHTML representation 
> -------------------------------------------------------------------
>
>                 Key: TIKA-379
>                 URL: https://issues.apache.org/jira/browse/TIKA-379
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Julien Nioche
>            Priority: Critical
>
> The following HTML document :
> <html lang="fi"><head>document 1 title</head><body>jotain 
> suomeksi</body></html>
> is rendered as the following xhtml by Tika :
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";><head><title/></head><body>document 1 
> titlejotain suomeksi</body></html>
> with the lang attribute getting lost. The lang is not stored in the metadata 
> either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to