[ https://issues.apache.org/jira/browse/TIKA-379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12834228#action_12834228 ]
Ken Krugler commented on TIKA-379: ---------------------------------- I think this is part of a bigger issue re attributes getting stripped. E.g. <a rel="nofollow> is important for web crawlers. Since the language attribute can be applied to a variety of tags, I don't think it's an option to just store it in the metadata. > Lang attribute on html tag skipped > ----------------------------------- > > Key: TIKA-379 > URL: https://issues.apache.org/jira/browse/TIKA-379 > Project: Tika > Issue Type: Bug > Components: parser > Reporter: Julien Nioche > > The following HTML document : > <html lang="fi"><head>document 1 title</head><body>jotain > suomeksi</body></html> > is rendered as the following xhtml by Tika : > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"><head><title/></head><body>document 1 > titlejotain suomeksi</body></html> > with the lang attribute getting lost. The lang is not stored in the metadata > either. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.