Hello,

We parse HTML using a ContentHandler. Tika uses TagSoup, which does not support 
modern HTML but we work-around the problem by fiddling with its HMTLSchema. Now 
we have access to HTML5 elements, and other curiosities such as allowing META 
anywhere in the body.

What we never managed to get to work, is reading attributes of the HTML 
element. So, any ideas on how to get attributes reported always?

Many thanks,
Markus

Reply via email to