Hello, We parse HTML using a ContentHandler. Tika uses TagSoup, which does not support modern HTML but we work-around the problem by fiddling with its HMTLSchema. Now we have access to HTML5 elements, and other curiosities such as allowing META anywhere in the body.
What we never managed to get to work, is reading attributes of the HTML element. So, any ideas on how to get attributes reported always? Many thanks, Markus
