[ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12856848#action_12856848 ]
Julien Nioche commented on TIKA-379: ------------------------------------ thanks for your comments. I had seen the HTMLMapper but as I pointed out {quote} There is actually a special treatment for the elements in HEAD done in the class HtmlHandler so simply adding *link* to the HTMLMapper does not solve the problem. {quote} I will send a patch later today which modifies the HTMLMapper to make it generate LINK elements in the XHTML output. This is a reasonable thing to do as this entity is allowed in the XHTML DTD. I will look at the HTMLMapper later to see how we could get it to keep the href attributes > Html elements and attributes not available in XHTML representation > ------------------------------------------------------------------- > > Key: TIKA-379 > URL: https://issues.apache.org/jira/browse/TIKA-379 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 0.7 > Reporter: Julien Nioche > Priority: Critical > > The following HTML document : > <html lang="fi"><head>document 1 title</head><body>jotain > suomeksi</body></html> > is rendered as the following xhtml by Tika : > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 > titlejotain suomeksi</body></html> > with the lang attribute getting lost. The lang is not stored in the metadata > either. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira